“对于那些希望深入研究人工智能而又不迷失方向的人来说,这是一本必读之书。 Kneusel 成功地向像我这样的外行解释了人工智能的工作原理。”
“A must-read for anyone wishing to dig into AI without getting lost in the weeds. Kneusel has succeeded in explaining how AI works to a layperson like myself.”
—肯尼思·盖斯 (Kenneth Gass),密尔沃基公共博物馆地质学名誉馆长
—KENNETH GASS, HONORARY CURATOR OF GEOLOGY, MILWAUKEE PUBLIC MUSEUM
“ 《人工智能如何运作》是对现代人工智能幕后的友好且个人化的一窥。 Ronald T. Kneusel 讲述了该领域如何发展的故事,并调查了推动人工智能革命的想法。从这本书中,你不仅会了解到人工智能今天是如何工作的,还会了解它的局限性、它的能力,以及它明天可能会把我们带向何方。”
“How AI Works is a friendly and personal peek behind the curtain of modern AI. Ronald T. Kneusel tells the story of how the field grew, and surveys the ideas that are powering the AI revolution. From this book, you’ll learn not only how AI works today, but its limits, its capabilities, and where it might take us tomorrow.”
——安德鲁·格拉斯纳 (Andrew Glassner), 《深度学习:视觉方法》一书的作者
—ANDREW GLASSNER, AUTHOR OF DEEP LEARNING: A VISUAL APPROACH
“ 《人工智能如何工作》是人工智能丰富历史的杰作,从早期的感知和符号系统到 ChatGPT 等大型语言模型。对于初学者来说,它揭开了人工智能的神秘面纱,是了解六十多年研究和开发最新动态的完美资源。对于那些精通人工智能的人来说,它是填补知识空白的宝贵工具。即使是人工智能专家也会获得新的视角,增强他们的理解和阐明复杂概念的能力。”
“How AI Works is a tour de force of the rich history of artificial intelligence, from the early perceptions and symbolic systems to large language models such as ChatGPT. For beginners, it demystifies AI and is a perfect resource to get up to date with more than six decades of research and development. For those versed in AI, it serves as an invaluable tool to fill knowledge gaps. Even AI experts will gain a fresh perspective, enhancing their understanding and ability to articulate complex concepts.”
— BEN DICKSON,软件工程师、TECHTALKS 编辑
—BEN DICKSON, SOFTWARE ENGINEER, EDITOR OF TECHTALKS
“读完这本书后,我对工作中已经使用的机器学习工具有了更好的了解,并对大型语言模型和未来的人工智能可能如何改变我工作的领域有了新的认识和见解。我向所有使用软件系统的人推荐这本书,包括管理人员,以及任何只想了解人工智能在幕后实际做什么的人。”
“After reading this book I have a better understanding of the ML tools I have already used in my work, and a new appreciation and insight to how Large Language Models, and future AI, will likely change the domains in which I work. I recommend this book to anyone who works with software systems, including management, and anyone who just wants to know what AI actually does under the hood.”
—丹尼尔·科西 (DANIEL KOSEY),CISSP,网络安全工程师
—DANIEL KOSEY, CISSP, CYBERSECURITY ENGINEER
从巫术到科学
From Sorcery to Science
旧金山
San Francisco
人工智能如何运作。版权所有 © 2024 Ronald T. Kneusel。
HOW AI WORKS. Copyright © 2024 by Ronald T. Kneusel.
版权所有。未经版权所有者和出版商事先书面许可,不得以任何形式或任何方式(电子或机械)复制或传播本作品的任何部分,包括复印、录制或通过任何信息存储或检索系统。
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher.
首次印刷
First printing
27 26 25 24 23 1 2 3 4 5
27 26 25 24 23 1 2 3 4 5
ISBN-13:978-1-7185-0372-4(印刷版)
ISBN-13:978-1-7185-0373-1(电子书)
ISBN-13: 978-1-7185-0372-4 (print)
ISBN-13: 978-1-7185-0373-1 (ebook)
出版商:William Pollock
总编辑:Jill Franklin
制作经理:Sabrina Plomitallo-González
制作编辑:Miles Bond
开发编辑:Eva Morrow
封面插画:Gina Redman
室内设计:Octopod Studios
技术审阅者:Alex Kachurin
文案编辑:Rachel Head
校对者:Carl Quesnel
Publisher: William Pollock
Managing Editor: Jill Franklin
Production Manager: Sabrina Plomitallo-González
Production Editor: Miles Bond
Developmental Editor: Eva Morrow
Cover Illustrator: Gina Redman
Interior Design: Octopod Studios
Technical Reviewer: Alex Kachurin
Copyeditor: Rachel Head
Proofreader: Carl Quesnel
有关分销、批量销售、公司销售或翻译的信息,请直接联系 No Starch Press ® : info@nostarch.com或:
For information on distribution, bulk sales, corporate sales, or translations, please contact No Starch Press® directly at info@nostarch.com or:
No Starch Press, Inc.
245 8th Street, San Francisco, CA 94103
电话:1.415.863.9900
www.nostarch.com
No Starch Press, Inc.
245 8th Street, San Francisco, CA 94103
phone: 1.415.863.9900
www.nostarch.com
美国国会图书馆控制号:2023038565
Library of Congress Control Number: 2023038565
No Starch Press 和 No Starch Press 徽标是 No Starch Press, Inc. 的注册商标。本文提及的其他产品和公司名称可能是其各自所有者的商标。我们不会在每次出现商标名称时都使用商标符号,而是仅以编辑方式使用这些名称,以维护商标所有者的利益,无意侵犯商标。
No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
本书中的信息按“原样”发布,不提供任何保证。尽管在准备本作品时已采取一切预防措施,但作者和 No Starch Press, Inc. 均不对任何个人或实体因直接或间接造成或声称造成的任何损失或损害承担任何责任。其中包含的信息。
The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.
对于弗兰克·罗森布拉特来说——他预见到了它的到来。
To Frank Rosenblatt—he saw it coming.
Alex Kachurin 是一位数据科学和机器学习专业人士,在该领域拥有超过 15 年的经验。他于 2010 年获得中佛罗里达大学计算机视觉硕士学位。
Alex Kachurin is a data science and machine learning professional with more than 15 years of experience in the field. He earned an MS in computer vision from the University of Central Florida in 2010.
Chapter 1: And Away We Go: An AI Overview
Chapter 2: Why Now? A History of AI
Chapter 3: Classical Models: Old-School Machine Learning
Chapter 4: Neural Networks: Brain-Like AI
Chapter 5: Convolutional Neural Networks: AI Learns to See
Chapter 6: Generative AI: AI Gets Creative
Chapter 7: Large Language Models: True AI at Last?
首先,感谢伊娃·莫罗(Eva Morrow)温柔(和善)的编辑。还要感谢理科硕士 Alex Kachurin 的见解、深思熟虑的评论和建议。最后,我要感谢无淀粉出版社的所有好心人对这本书的信任并帮助它成为现实。
Thanks, first and foremost, to Eva Morrow for her gentle (and kind) editing. Thanks also to Alex Kachurin, MS, for his insights, thoughtful comments, and suggestions. Finally, I want to thank all the good folks at No Starch Press for believing in the book and helping to make it a reality.
许多书籍教你如何进行人工智能(AI)。同样,许多流行书籍都会向您介绍人工智能。然而,似乎缺少一本从概念层面教你人工智能如何工作的书。人工智能不是魔法,而是魔法。您可以理解它在做什么,而无需将自己埋在复杂的数学中。
Many books teach you how to do artificial intelligence (AI). Similarly, many popular books tell you about AI. However, what seems to be missing is a book that teaches you how AI works at a conceptual level. AI isn’t magic; you can understand what it’s doing without burying yourself in complex mathematics.
这本书用对人工智能如何工作的非数学解释填补了这一空白。虽然有些书隐藏在杂草丛中,有些书提供鸟瞰图,但本书却位于树顶。它旨在为您提供足够的细节来理解该方法,而不会陷入实际的数学困境。如果这激起了您的兴趣,我邀请您继续阅读。
This book fills that void with a math-free explanation of how AI works. While some books are down in the weeds and others offer a bird’s-eye view, this book is at treetop level. It aims to provide you with enough detail to understand the approach without getting bogged down in nitty-gritty mathematics. If that piques your interest, I invite you to read on.
您会遇到整本书中出现 **** 的地方。这些标记突出显示主题的转变或过渡点。在教科书中,**** 表示新的部分,但这不是教科书,我也不希望它感觉像教科书;因此,我将使用星号来警告您即将发生更改,而不是章节和小节。像这样 。 。 。
You’ll run across places where **** appears throughout the book. These markers highlight a shift in the topic or a transition point. In a textbook, **** would indicate a new section, but this isn’t a textbook, nor do I want it to feel like one; so, instead of sections and subsections, I’ll use asterisks to warn you that a change is coming. Like this . . .
****
****
我第一次了解人工智能是在 1987 年,在一门同名本科课程中。人们通常所说的人工智能的含义已经改变在此期间的几十年里。尽管如此,目标仍然是一样的:模仿机器的智能行为。
I first learned about artificial intelligence in 1987, in an undergraduate course of the same name. What people typically mean by AI has changed somewhat over the intervening decades. Still, the goal remains the same: to mimic intelligent behavior in a machine.
20 世纪 80 年代,很少有人有理由去了解人工智能,即使他们意识到这一点。除了《星际迷航》或《战争游戏》等科幻电视节目和电影中偶尔出现的叛逆计算机之外,人工智能对他们的日常生活影响微乎其微,更不用说无情而可怕的终结者了。
Few people in the 1980s had any reason to learn about AI, if they were even aware of it. AI had minimal impact on their daily lives, beyond the occasional renegade computer in science fiction TV shows and movies like Star Trek or WarGames, to say nothing of the relentless and terrifying Terminator.
然而,尽管当前复古时尚潮流盛行,20 世纪 80 年代早已过去,人工智能无处不在。它每天以多种方式影响着我们的生活,从手机告诉我们开车到这里而不是那里,到在照片中给朋友和家人贴上标签,再到无论你喜欢与否,不断在网上向我们提供的文章和广告。更不用说最近涉及大型语言模型的人工智能爆炸式增长,许多人最终将其解释为“真正的人工智能”。
However, the 1980s are long gone, current retro fashion trends notwithstanding, and AI is everywhere. It affects our lives in numerous ways every day, from phones telling us to drive here and not there, to labeling friends and family in pictures, to the articles and ads fed to us continuously online, like it or not. And this is to say nothing of the recent AI explosion involving large language models, which many interpret as “true AI” at last.
人工智能也在幕后以我们很少意识到的方式存在:航空公司飞行计划、运输和物流、工厂自动化、地球卫星成像,以及帮助医生确定肿块是否是癌症等等。
AI is also there behind the scenes in ways we seldom realize: airline flight planning, shipping and logistics, factory automation, satellite imaging of the earth, and helping your doctor decide if that lump is cancer, to name a few.
为什么现在要学习人工智能?
Why learn about AI now?
本书通过解释发生的事情、何时发生、为何发生,以及最重要的是,它是如何发生的来回答这个问题——所有这些都没有炒作或单一的数学方程。坦率地说,人工智能革命背后的现实已经足够令人印象深刻了。炒作是不必要的。
This book answers that question by explaining what happened, when it happened, why it happened, and, most importantly, how it happened—all without hype or a single mathematical equation. Frankly, the reality behind the AI revolution is impressive enough; the hype is unnecessary.
此刻,我觉得一些关于我的话语是恰当的。毕竟,我邀请你和我一起踏上人工智能世界的旅程,所以对你的指南感到好奇是合理的。我当然会的。
At this point, I feel some words about me are in order. After all, I’m asking you to join me on a journey through the world of AI, so it’s reasonable to wonder about your guide. I certainly would.
如前所述,我是在 20 世纪 80 年代末接触人工智能的。我于 2003 年开始从事人工智能领域的工作,属于机器学习的子领域,将机器学习模型应用于血管内超声图像。
As mentioned earlier, I was introduced to AI in the late 1980s. I began working in AI, in the subfield known as machine learning, in 2003, applying machine learning models to intravascular ultrasound images.
我第一次听说深度学习是在 2010 年。深度学习是机器学习的一个子领域。我将在第 1 章中阐明深度学习、机器学习和人工智能之间的区别,但现在您可以将它们视为同一事物。
I first heard of deep learning in 2010. Deep learning is a subfield of machine learning. I’ll clarify the difference between deep learning, machine learning, and artificial intelligence in Chapter 1, but for now you can think of them as the same thing.
2012 年,随着 AlexNet 的出现以及谷歌一项涉及计算机学会识别 YouTube 视频中的猫的奇怪实验,人工智能突然出现在现场,或者至少成为新闻。当 Google 展示其论文时,我正在苏格兰爱丁堡举行的 2012 年国际机器学习会议的房间里。会议现场只能容纳 800 名左右的与会者。
In 2012, AI burst onto the scene—or at least into the news—with the advent of what came to be called AlexNet and a curious experiment at Google involving computers that learned to identify cats in YouTube videos. I was in the room at the 2012 International Conference on Machine Learning in Edinburgh, Scotland, when Google presented its paper. It was standing room only for the conference’s 800 or so attendees.
2016 年,我在 Michael Mozer 的指导下,在科罗拉多大学博尔德分校获得了计算机科学博士学位,专攻人工智能。从那时起,我每天都在人工智能领域工作,主要是在国防工业,2016 年短暂休息,帮助共同创立了一家医疗人工智能初创公司。
In 2016, I completed a PhD in computer science specializing in AI at the University of Colorado, Boulder, under the direction of Michael Mozer. I’ve worked in AI daily since then, primarily in the defense industry, with a short break in 2016 to help co-found a medical AI startup.
AlexNet 之后,事情发生了迅速变化,学术文献中似乎每个月都会出现一些与人工智能相关的新“奇迹”,甚至出现在晚间新闻中。跟上潮流的唯一方法就是每年多次参加会议;等待结果出现在学术期刊上是没有意义的,因为对于学术出版通常缓慢的节奏来说,该领域的进展太快了。
After AlexNet, things changed quickly, as seemingly monthly some new AI-related “miracle” appeared in the academic literature, if not on the evening news. The only way to keep up was to attend conferences multiple times per year; waiting for results to appear in an academic journal was pointless, as the field was progressing too rapidly for the typically slow pace of academic publishing.
我是在 2022 年 11 月的 NeurIPS 会议上撰写这篇序言的。 NeurIPS 可以说是最重要的人工智能会议(请不要发送仇恨电子邮件!),这是自 COVID-19 大流行以来首次面对面举行。出席人数很高,但可能没有 2019 年会议那么高,当时通过抽签决定了哪些 13,500 人可以参加。十年来,会议出席人数从几百人增加到了 10,000 多人,这一事实告诉我们人工智能研究已经变得多么重要。
I’m writing this preface in November 2022 at the NeurIPS conference. NeurIPS is arguably the premier AI conference (no hate emails, please!), and this is the first time it’s been held in person since the COVID-19 pandemic. Attendance is high, though perhaps not as high as at the 2019 conference, for which a lottery was held to determine which 13,500 people could attend. The fact that conference attendance has blossomed from a few hundred to over 10,000 in a decade tells us how important AI research has become.
支持这些会议的科技行业领袖的名字也揭示了人工智能的重要性,这些会议是研究生的主要狩猎场。您会发现 Google、DeepMind(也称为 Google)、Meta(即 Facebook)、亚马逊、苹果等公司的展位。人工智能推动了这些公司的大部分业务。人工智能是大钱。人工智能运行在数据之上,这些公司吞噬了我们免费提供给他们的所有数据,以换取他们的服务。
The names of the tech industry leaders who support these conferences, which are prime hunting grounds for graduate students, also reveal the significance of AI. You’ll find expo booths for Google, DeepMind (also Google), Meta (read: Facebook), Amazon, Apple, and others. AI drives much of what these companies do. AI is big bucks. AI runs on data, and these companies gobble up all the data we freely give them in exchange for their services.
读完本书后,您将了解人工智能在幕后(或引擎盖,如果您愿意的话)正在做什么。最终,理解起来并不是那么困难,尽管细节决定成败。
By the end of the book, you’ll understand what AI is doing under the hood (or bonnet, if you prefer). Ultimately, it isn’t all that difficult to comprehend, though the devil is definitely in the details.
本书内容如下:
The book proceeds as follows:
第 1 章,我们走吧:人工智能概述我们将快速概述人工智能的要点和一个基本示例。
Chapter 1, And Away We Go: An AI Overview We dive in with a quick overview of AI essentials and a basic example.
第 2 章,为什么是现在?人工智能的历史人工智能并不是从天上掉下来的。本章向您介绍人工智能的背景故事,并阐明为什么现在正在发生这场革命。
Chapter 2, Why Now? A History of AI AI didn’t just fall from the sky. This chapter gives you AI’s backstory and clarifies why the revolution is happening now.
第 3 章,经典模型:老式机器学习现代人工智能都是神经网络,但要了解神经网络在做什么,有助于理解之前的模型。
Chapter 3, Classical Models: Old-School Machine Learning Modern AI is all neural networks, but to understand what neural networks are doing, it helps to understand the models that came before.
第 4 章,神经网络:类脑人工智能如果您想了解神经网络是什么、它是如何训练的以及如何使用它,那么本章适合您。
Chapter 4, Neural Networks: Brain-Like AI If you want to know what a neural network is, how it’s trained, and how it’s used, then this chapter is for you.
第 5 章,卷积神经网络:人工智能学会观察现代人工智能的大部分力量来自于学习表示数据的新方法。如果这句话对你来说没有意义,本章将会有所帮助。
Chapter 5, Convolutional Neural Networks: AI Learns to See Much of the power of modern AI comes from learning new ways to represent data. If that sentence has no meaning for you, this chapter will help.
第 6 章,生成式 AI:AI 发挥创造力传统的监督机器学习模型为输入附加标签。生成式人工智能会产生新颖的输出,包括文本、图像甚至视频。本章探讨了两种流行的方法:生成对抗网络(GAN)和扩散模型。 GAN 提供了我们探索扩散模型以及第 7 章中的大型语言模型 (LLM) 所需的直觉。扩散模型擅长根据文本提示生成详细、逼真的图像和视频。
Chapter 6, Generative AI: AI Gets Creative Traditional supervised machine learning models attach labels to inputs. Generative AI produces novel output, including text, images, and even video. This chapter explores two popular approaches: generative adversarial networks (GANs) and diffusion models. GANs provide the intuition we need to explore diffusion models and, in Chapter 7, large language models (LLMs). Diffusion models are adept at producing detailed, photorealistic images and videos from text prompts.
第 7 章,大型语言模型:真正的人工智能终于实现了吗? OpenAI 于 2022 年秋季发布的大型语言模型 ChatGPT 很可能迎来了真正的人工智能时代。本章探讨了法学硕士:它们是什么,它们如何运作,以及它们是新的和颠覆性的东西。
Chapter 7, Large Language Models: True AI at Last? OpenAI’s fall 2022 release of its large language model, ChatGPT, might very well have ushered in the era of true AI. This chapter explores LLMs: what they are, how they work, and the claim that they are something new and disruptive.
第 8 章,思考:人工智能的影响大型语言模型的出现改变了人工智能的格局。本章思考其含义。
Chapter 8, Musings: The Implications of AI The advent of large language models has altered the AI landscape. This chapter muses on the implications.
如果您对人工智能有兴趣并且想了解更多信息,在本书的最后,您将找到一系列可供探索的其他资源。就我个人而言,诚然,我有偏见地推荐我的书《实用深度学习:基于 Python 的简介》 (2021) 和《深度学习数学:理解神经网络需要了解什么》 (2021),这两本书都可以从 No Starch Press 获得。他们将为您提供从阅读人工智能如何概念性地工作到“实践”人工智能所需的一切。
At the end of the book, you’ll find a collection of additional resources to explore, should the AI bug bite and you want to learn more. Personally, and admittedly with bias, I recommend my books Practical Deep Learning: A Python-Based Introduction (2021) and Math for Deep Learning: What You Need to Know to Understand Neural Networks (2021), both available from No Starch Press. They will give you what you need to go from reading about how AI works conceptually to “doing” AI.
最后,当您阅读时,您会注意到文本中的特定短语被强调。许多这些强调的单词和短语的定义可以在本书末尾的术语表中找到。与每个领域一样,人工智能也有其术语。将所有术语记在脑子里是很麻烦的,因此术语表可以帮助您记住它们。
Finally, as you read, you’ll notice that specific phrases in the text are emphasized. Definitions for many of these emphasized words and phrases are found in the glossary at the end of the book. Like every field, AI has its jargon. Keeping all the terms in your head is burdensome, hence the glossary to help you remember them.
我是一个真实的人。我知道,因为我可以成功识别并单击火车和交通灯的图像。如果您对本书中的内容有任何意见或疑问,我希望收到您的来信。请给我发电子邮件:rkneuselbooks@gmail.com。
I’m a real person. I know because I can successfully identify and click images of trains and traffic lights. If you have comments or questions about the material in this book, I want to hear from you. Please email me at rkneuselbooks@gmail.com.
现在,如果你准备好了,我们就出发吧。
Now, if you’re ready, away we go.
人工智能试图引导机器(通常是计算机)按照人类认为智能的方式行事。该短语由著名计算机科学家约翰·麦卡锡 (John McCarthy,1927-2011) 在 20 世纪 50 年代创造。
Artificial intelligence attempts to coax a machine, typically a computer, to behave in ways humans judge to be intelligent. The phrase was coined in the 1950s by prominent computer scientist John McCarthy (1927–2011).
本章旨在阐明人工智能是什么及其与机器学习和深度学习的关系,这两个术语您近年来可能听说过。我们将通过一个实际的机器学习示例来深入探讨。将本章视为整个人工智能的概述。后面的章节将基于并回顾此处介绍的概念。
This chapter aims to clarify what AI is and its relationship to machine learning and deep learning, two terms you may have heard in recent years. We’ll dive in with an example of machine learning in action. Think of this chapter as an overview of AI as a whole. Later chapters will build on and review the concepts introduced here.
****
****
通过给计算机提供一系列指令、包含算法的程序或程序使计算机执行的配方,对计算机进行编程以执行特定任务。
Computers are programmed to carry out a particular task by giving them a sequence of instructions, a program, which embodies an algorithm, or the recipe that the program causes the computer to execute.
算法这个词现在经常被提及,尽管它并不新鲜。这是花剌子米的讹误,花剌子米指的是九世纪波斯数学家穆罕默德·本·穆萨·花剌子米,他给世界的主要礼物是我们称之为代数的数学。
The word algorithm is cast about often these days, though it isn’t new; it’s a corruption of Al-Khwarizmi, referring to ninth-century Persian mathematician Muhammad ibn Musa al-Khwarizmi, whose primary gift to the world was the mathematics we call algebra.
****
****
让我们从一个故事开始。
Let’s begin with a story.
托尼亚拥有一家成功的辣酱工厂。辣酱的配方是托尼娅自己的,她小心翼翼地保管着。这实际上是她的秘密酱汁,只有她知道制作过程。
Tonya owns a successful hot sauce factory. The hot sauce recipe is Tonya’s own, and she guards it carefully. It’s literally her secret sauce, and only she understands the process of making it.
托尼亚在辣酱制作过程的每个步骤都雇用一名工人。这些都是人类工人,但托尼亚对待他们就像对待机器一样,因为她担心他们会偷走她的辣酱配方,而且因为托尼亚有点像怪物。事实上,工人们并不介意,因为她给他们的工资很高,但他们却在背后嘲笑她。
Tonya employs one worker for each step of the hot sauce–making process. These are human workers, but Tonya treats them as if they were machines because she’s worried they’ll steal her hot sauce recipe—and because Tonya is a bit of a monster. In truth, the workers don’t mind much because she pays them well, and they laugh at her behind her back.
托尼娅的秘诀是一种算法;这是制作辣酱必须遵循的一组步骤。托尼亚用来告诉她的工人如何制作辣酱的指令集是一个程序。该程序以工人(机器)可以逐步遵循的方式体现算法。托尼亚(Tonya)对她的工人进行了编程,以实施她的算法来创建辣酱。该序列看起来像这样:
Tonya’s recipe is an algorithm; it’s the set of steps that must be followed to create the hot sauce. The collection of instructions Tonya uses to tell her workers how to make the hot sauce is a program. The program embodies the algorithm in a way that the workers (the machine) can follow step by step. Tonya has programmed her workers to implement her algorithm to create hot sauce. The sequence looks something like this:
对于这种情况,有几点需要注意。首先,托尼娅绝对是一个把人类当机器的怪物。其次,在制作辣酱的过程中,任何工人都不需要了解他们为什么要做他们所做的事情。第三,程序员(托尼娅)知道机器(工人)为什么要做它所做的事情,即使机器不这样做。
There are a few things to note about this scenario. First, Tonya is definitely a monster for treating human beings as machines. Second, at no point in the process of making hot sauce does any worker need to understand why they do what they do. Third, the programmer (Tonya) knows why the machine (the workers) does what it does, even if the machine doesn’t.
****
****
我刚才描述的是我们如何控制几乎所有计算机,可以追溯到艾伦·图灵在 20 世纪 30 年代设想的第一台概念机器,甚至更早到 19 世纪查尔斯·巴贝奇的分析机。人类构思一种算法,然后将该算法转化为一系列步骤(程序)。机器执行程序,从而实现算法。机器不明白自己在做什么;它只是执行一系列原始指令。
What I’ve just described is how we’ve controlled virtually all computers, going back to the first conceptual machines envisioned by Alan Turing in the 1930s and even earlier to the 19th-century Analytical Engine of Charles Babbage. A human conceives an algorithm, then translates that algorithm into a sequence of steps (a program). The machine executes the program, thereby implementing the algorithm. The machine doesn’t understand what it’s doing; it’s simply performing a series of primitive instructions.
巴贝奇和图灵的天才在于认识到可能存在一种能够通过程序执行任意算法的通用机器。然而,我认为,正是巴贝奇的朋友阿达·洛夫莱斯(Ada Lovelace),她通常被认为是世界上第一个程序员,最初理解了我们现在所说的计算机的深远可能性。我们将在第 2 章中详细讨论图灵、巴贝奇和洛夫莱斯。
The genius of Babbage and Turing lay in the realization that there could be a general-purpose machine capable of executing arbitrary algorithms via programs. However, I would argue that it was Ada Lovelace, a friend of Babbage’s often regarded as the world’s first programmer, who initially understood the far-reaching possibilities of what we now call a computer. We’ll talk more about Turing, Babbage, and Lovelace in Chapter 2.
笔记
NOTE
在洛夫莱斯的时代,“计算机”不是一台机器,而是一个用手计算的人。因此,巴贝奇的引擎是一台机械计算机。
In Lovelace’s day, a “computer” was not a machine but a human being who calculated by hand. Hence, Babbage’s Engine was a mechanical computer.
让我们花点时间探讨一下人工智能、机器学习和深度学习这些术语之间的关系。一方面,这三者已成为现代人工智能的代名词。这是错误的,但很方便。图 1-1显示了术语之间的正确关系。
Let’s take a moment to explore the relationship between the terms AI, machine learning, and deep learning. On the one hand, all three have become synonymous as referring to modern AI. This is wrong, but convenient. Figure 1-1 shows the proper relationship between the terms.
图1-1:人工智能、机器学习和深度学习之间的关系
Figure 1-1: The relationship between artificial intelligence, machine learning, and deep learning
深度学习是机器学习的一个子领域,机器学习是人工智能的一个子领域。这种关系意味着人工智能涉及的概念既不是机器学习也不是深度学习。我们将这些概念称为老式人工智能,其中包括 20 世纪 50 年代以来开发的算法和方法。老派人工智能并不是人们现在讨论人工智能时所指的意思。展望未来,我们将完全(并且不公平地)忽略人工智能宇宙的这一部分。
Deep learning is a subfield of machine learning, which is a subfield of artificial intelligence. This relationship implies that AI involves concepts that are neither machine learning nor deep learning. We’ll call those concepts old-school AI, which includes the algorithms and approaches developed from the 1950s onward. Old-school AI is not what people currently mean when discussing AI. Going forward, we’ll entirely (and unfairly) ignore this portion of the AI universe.
机器学习根据数据构建模型。对我们来说,模型是接受输入并生成输出的抽象概念,其中输入和输出以某种有意义的方式相关。机器学习的主要目标是使用已知数据调节模型,以便模型在给定未知数据时产生有意义的输出。这就像浑水一样清澈,但请耐心听我说;泥巴会及时沉淀下来。
Machine learning builds models from data. For us, a model is an abstract notion of something that accepts inputs and generates outputs, where the inputs and outputs are related in some meaningful way. The primary goal of machine learning is to condition a model using known data so that the model produces meaningful output when given unknown data. That’s about as clear as muddy water, but bear with me; the mud will settle in time.
深度学习使用以前太大而无法使用的大型模型。更多浑水,但我要说的是,除了涉及多层神经网络之外,深度学习没有严格的定义。第 4 章将对此进行澄清。
Deep learning uses large models of the kind previously too big to make useful. More muddy water, but I’m going to argue that there’s no strict definition of deep learning other than that it involves neural networks with many layers. Chapter 4 will clarify.
在本书中,我们会比较草率,但会根据流行的用法,甚至是专家的用法,将“深度学习”表示大型神经网络(尚未正式定义),“机器学习”表示以数据为条件的模型, “AI”是机器学习和深度学习的统称——记住,AI 的含义远不止我们在这里讨论的内容。
In this book, we’ll be sloppy but in accord with popular usage, even by experts, and take “deep learning” to mean large neural networks (yet to be formally defined), “machine learning” to mean models conditioned by data, and “AI” to be a catchall for both machine learning and deep learning—remembering that there is more to AI than what we discuss here.
数据就是人工智能的一切。我怎么强调都不为过。模型是一张白纸,数据必须对其进行调整才能使其适合任务。如果数据不好,模型就不好。在整本书中,我们将回到“好”和“坏”数据的概念。
Data is everything in AI. I can’t emphasize this enough. Models are blank slates that data must condition to make them suitable for a task. If the data is bad, the model is bad. Throughout the book, we’ll return to this notion of “good” and “bad” data.
现在,让我们重点关注什么是模型、如何通过调节使其变得有用,以及在调节后如何使用它。所有这些关于调节和使用的讨论听起来黑暗和险恶,甚至是邪恶的,但是,我向你保证,事实并非如此,尽管我们有办法让模型说话。
For now, let’s focus on what a model is, how it’s made useful by conditioning, and how it’s used after conditioning. All this talk of conditioning and using sounds dark and sinister, if not altogether evil, but, I assure you, it’s not, even though we have ways of making the model talk.
****
****
机器学习模型是一个黑匣子,它接受输入(通常是数字的集合)并生成输出,通常是“狗”或“猫”等标签,或者是“狗”或“狗”的概率等连续值。具有给定模型的特征(大小、浴室数量、邮政编码等)的房屋的价值。
A machine learning model is a black box that accepts an input, usually a collection of numbers, and produces an output, typically a label like “dog” or “cat,” or a continuous value like the probability of being a “dog” or the value of a house with the characteristics given to the model (size, number of bathrooms, ZIP code, and so on).
该模型具有控制模型输出的参数。调节模型(称为训练)旨在设置模型的参数,以便为给定的输入生成正确的输出。
The model has parameters, which control the model’s output. Conditioning a model, known as training, seeks to set the model’s parameters in such a way that they produce the correct output for a given input.
训练意味着我们有一组输入,以及模型在给定这些输入时应产生的输出。乍一看,这似乎有点愚蠢。为什么我们希望模型为我们提供已有的输出?答案是,在未来的某个时刻,我们将拥有我们还没有输出的输入。这就是创建模型的全部要点:将其用于未知的输入,并在模型为我们提供输出时相信该模型。
Training implies that we have a collection of inputs, and the outputs the model should produce when given those inputs. At first blush, this seems a bit silly; why do we want the model to give us an output we already have? The answer is that we will, at some future point, have inputs for which we don’t already have the output. This is the entire point of making the model: to use it with unknown inputs and to believe the model when it gives us an output.
训练使用已知输入和输出的集合来调整模型的参数以最大程度地减少错误。如果我们能做到这一点,当给定新的未知输入时,我们就开始相信模型的输出。
Training uses the collection of known inputs and outputs to adjust the model’s parameters to minimize mistakes. If we can do that, we begin to believe the model’s outputs when given new, unknown inputs.
训练模型与编程根本不同。在编程中,我们通过一步步指导计算机来实现我们想要的算法。在训练中,我们使用数据来教导模型调整其参数以产生正确的输出。没有编程,因为大多数时候,我们不知道算法应该是什么。我们只知道或相信输入和期望输出之间存在关系。我们希望模型能够很好地近似这种关系,从而发挥作用。
Training a model is fundamentally different from programming. In programming, we implement the algorithm we want by instructing the computer step by step. In training, we use data to teach the model to adjust its parameters to produce correct output. There is no programming because, most of the time, we have no idea what the algorithm should be. We only know or believe a relationship exists between the inputs and the desired outputs. We hope a model can approximate that relationship well enough to be useful.
值得记住英国统计学家乔治·博克斯的名言,他说所有模型都是错误的,但有些模型是有用的。当时,他指的是其他类型的数学模型,但其中的智慧也适用于机器学习。
It’s worth remembering the sage words of British statistician George Box, who said that all models are wrong, but some are useful. At the time, he was referring to other kinds of mathematical models, but the wisdom applies to machine learning.
现在我们明白为什么这个领域被称为机器学习:我们通过给机器(模型)提供数据来教导它。我们不为机器编程;我们只为机器编程。我们指导它。
Now we understand why the field is called machine learning: we teach the machine (model) by giving it data. We don’t program the machine; we instruct it.
这是机器学习算法:
Here, then, is the machine learning algorithm:
大多数机器学习都遵循这个算法。由于我们使用已知的标记数据来训练模型,因此这种方法称为监督学习:我们在模型学习产生正确输出的同时对其进行监督。从某种意义上说,我们会惩罚模型,直到它正确为止。毕竟,这是一个黑暗的企业。
Most of machine learning follows this algorithm. Since we’re using known labeled data to train the model, this approach is called supervised learning: we supervise the model while it learns to produce correct output. In a sense, we punish the model until it gets it right. This is a dark enterprise, after all.
我们已经准备好一个例子了,但让我们首先总结一下到目前为止的故事。我们想要一个系统,对于未知的输入,我们可以获得有意义的输出。为了构建该系统,我们使用输入及其已知输出的集合来训练机器学习模型。通过修改来训练模型它的参数可以最大限度地减少在训练数据上犯的错误。当我们对模型的性能感到满意时,我们会使用具有未知输入的模型,因为我们现在相信模型为我们提供输出(至少在大多数情况下)。
We’re ready for an example, but let’s first summarize the story so far. We want a system where, for an unknown input, we get a meaningful output. To make the system, we train a machine learning model using a collection of inputs and their known outputs. Training conditions the model by modifying its parameters to minimize the mistakes it makes on the training data. When we’re satisfied with the model’s performance, we use the model with unknown inputs because we now believe the model when it gives us an output (at least, most of the time).
我们的第一个例子来自一个著名的数据集,其中包含鸢尾花各部分的测量结果。该数据集来自 20 世纪 30 年代,表明人们思考我们现在所说的机器学习有多久了。
Our first example comes from a famous dataset consisting of measurements of the parts of iris flowers. This dataset is from the 1930s, indicating how long people have contemplated what we now call machine learning.
目标是建立一个模型,对于输入的测量值集合,输出特定的鸢尾花品种。完整数据集包含三种鸢尾花的四种测量值。我们将保持简单并使用两个测量值和两个物种:I. setosa和I. versicolor 的花瓣长度和宽度(以厘米 (cm) 为单位) 。因此,我们希望模型接受两个测量值作为输入,并给出一个可以解释为I. setosa或I. versicolor 的输出。像这样的二元模型在两个可能的输出之间做出决定,并且在人工智能中很常见。如果模型在两个以上类别之间做出决定,则它是多类模型。
The goal is a model that, for an input collection of measurements, outputs the specific species of iris flower. The full dataset has four measurements for three iris species. We’ll keep it simple and use two measurements and two species: petal length and width in centimeters (cm) for I. setosa versus I. versicolor. Therefore, we want the model to accept two measurements as input and give us an output we can interpret as I. setosa or I. versicolor. Binary models like this decide between two possible outputs and are common in AI. If the model decides between more than two categories, it’s a multiclass model.
我们的数据集中有 100 个样本:100 对花瓣测量值以及相应的鸢尾花类型。我们将调用I. setosa类 0 和I. versicolor类 1,其中class标记输入类别。
We have 100 samples in our dataset: 100 pairs of petal measurements, and the corresponding iris flower types. We’ll call I. setosa class 0 and I. versicolor class 1, where class labels the input categories.
模型通常需要数字类标签,这告诉我们模型不知道它们的输入和输出意味着什么;它们仅在输入和输出集之间建立关联。模型不会使用任何普遍接受的单词定义来“思考”。 (第 7 章的模型可能会有所不同,但接下来会有更多内容。)
Models often want numeric class labels, which tells us that models don’t know what their inputs and outputs mean; they only make associations between sets of inputs and outputs. Models don’t “think” using any commonly accepted definition of the word. (The models of Chapter 7 might beg to differ, but more on that then.)
****
****
在这里我们必须停下来介绍一些关键术语。我知道,这不是你想读的内容,但它对接下来的内容至关重要。人工智能经常使用向量和矩阵(单数“矩阵”)。向量是被视为单个实体的一串数字。例如,每朵鸢尾花的四个测量值意味着我们可以将花表示为一串四个数字,例如(4.5、2.3、1.3、0.3)。该向量描述的花的萼片长度为4.5厘米,萼片宽度为2.3厘米,花瓣长度为1.3厘米,花瓣宽度为0.3厘米。通过将这些测量结果分组在一起,我们可以将它们称为单个实体。
Here we must pause to introduce some critical terminology. I know, not what you want to read, but it’s essential to all that follows. Artificial intelligence makes frequent use of vectors and matrices (singular “matrix”). A vector is a string of numbers treated as a single entity. For example, the four measurements of each iris flower mean we can represent the flower as a string of four numbers, say, (4.5, 2.3, 1.3, 0.3). The flower described by this vector has a sepal length of 4.5 cm, sepal width of 2.3 cm, petal length of 1.3 cm, and petal width of 0.3 cm. By grouping these measurements together, we can refer to them as a single entity.
向量中元素的数量决定了它的维数;例如,鸢尾花数据集使用四维向量,即花的四个测量值。人工智能通常处理具有数百甚至数千维的输入。如果输入是图像,则该图像的每个像素都是一维,这意味着 28 像素方形小图像将成为 28×28 或 784 维的输入向量。这个概念在 3 个维度或 33,000 个维度中是相同的:它仍然是被视为单个实体的一串数字。但图像有行和列,使其成为二维数字数组,而不是字符串。二维数字数组是矩阵。在机器学习中,我们经常将数据集表示为矩阵,其中行是表示数据集元素的向量,例如虹膜花,柱子是尺寸。例如,鸢尾花数据集中的前五朵花形成以下矩阵:
The number of elements in a vector determines its dimensionality; for example, the iris dataset uses four-dimensional vectors, the four measurements of the flower. AI often works with inputs that have hundreds or even thousands of dimensions. If the input is an image, every pixel of that image is one dimension, meaning a small 28-pixel-square image becomes an input vector of 28×28, or 784 dimensions. The concept is the same in 3 dimensions or 33,000 dimensions: it remains a string of numbers treated as a single entity. But an image has rows and columns, making it a two-dimensional array of numbers, not a string. Two-dimensional arrays of numbers are matrices. In machine learning, we often represent datasets as matrices, where the rows are vectors representing the elements of the dataset, like an iris flower, and the columns are the measurements. For example, the first five flowers in the iris dataset form the following matrix:
每一行都是一朵花。请注意,第一行与向量示例匹配。其余行列出了其他花朵的尺寸。
Each row is a flower. Notice that the first row matches the vector example. The remaining rows list the measurements for other flowers.
当你阅读时,请记住以下想法:
While you’re reading, keep these thoughts in the back of your mind:
随着我们继续探索人工智能,向量和矩阵之间的差异将成为焦点。现在,让我们回到我们的故事。
As we continue our exploration of AI, the differences between vectors and matrices will come into focus. Now, let’s return to our story.
****
****
模型的输入是其特征。我们的鸢尾花数据集有两个特征,花瓣的长度和宽度,它们被分组为特征向量(或样本)。单个特征向量用作模型的输入。二元模型的输出通常是与模型认为输入属于类别 1 相关的数字。对于我们的示例,我们将为模型提供一个由两个特征组成的特征向量,并期望一个输出让我们决定是否应该调用输入I. versicolor。如果不是,我们将输入声明为I. setosa,因为我们假设输入始终是其中之一。
The inputs to a model are its features. Our iris flower dataset has two features, the petal’s length and width, which are grouped into feature vectors (or samples). A single feature vector serves as the model’s input. A binary model’s output is typically a number relating to the model’s belief that the input belongs to class 1. For our example, we’ll give the model a feature vector consisting of two features and expect an output that lets us decide whether we should call the input I. versicolor. If not, we declare the input to be I. setosa because we assume that inputs will always be one or the other.
机器学习礼仪规定我们应该测试我们的模型;否则,我们怎么知道它正在工作?您可能会认为,当所有训练样本都正确时,它就起作用了,但经验告诉实践者,情况并非总是如此。测试模型的正确方法是保留一些标记的训练数据以供训练后使用。模型在这个保留的测试数据集上的表现更好地表明了模型的学习情况。我们将使用 80 个标记样本进行训练,并保留其中 20 个用于测试,确保训练集和测试集包含两个类别(花卉类型)的大致均匀混合。在实践中,尽可能做到这一点也是必要的。如果我们从不显示特定输入类别的模型示例,那么它如何学习将该类别与其他类别区分开来?
Machine learning etiquette states that we should test our model; otherwise, how will we know it’s working? You might think it’s working when it gets all the training samples right, but experience has taught practitioners this isn’t always the case. The proper way to test a model is to keep some of the labeled training data to use after training. The model’s performance on this held-out test dataset better indicates how well the model has learned. We’ll use 80 labeled samples for training and keep 20 of them for testing, making sure that both the training and test sets contain an approximately even mix of both classes (flower types). This is also essential in practice, as far as possible. If we never show the model examples of a particular class of input, how can it learn to distinguish that class from others?
使用保留的测试集来判断模型的性能不仅仅是礼仪。它解决了机器学习的一个基本问题:泛化。一些机器学习模型遵循的过程与广泛使用的优化方法非常相似。科学家和工程师利用优化将测量数据拟合到已知函数;机器学习模型也使用优化来调节其参数,但目标是不同的。将数据拟合到函数(例如直线)旨在创建最佳拟合,或最能解释测量数据的直线。在机器学习中,我们需要一个能够学习训练数据的一般特征以推广到新数据的模型。这就是为什么我们使用保留的测试集来评估模型。对于模型来说,测试集包含未用于修改其参数的新的、未见过的数据。模型在测试集上的表现是其泛化能力的线索。
Using a held-out test set to judge the performance of a model isn’t just etiquette. It addresses a foundational issue in machine learning: generalization. Some machine learning models follow a process quite similar to a widely used approach known as optimization. Scientists and engineers use optimization to fit measured data to known functions; machine learning models also use optimization to condition their parameters, but the goal is different. Fitting data to a function, like a line, seeks to create the best possible fit, or the line that best explains the measured data. In machine learning, we instead want a model that learns the general characteristics of the training data to generalize to new data. That’s why we evaluate the model with the held-out test set. To the model, the test set contains new, unseen data it didn’t use to modify its parameters. The model’s performance on the test set is a clue to its generalization abilities.
我们的示例有两个输入特征,这意味着特征向量是二维的。由于我们有两个维度,我们可以选择绘制训练数据集的图。 (如果特征向量中有两个或三个特征,我们可以绘制特征向量。但是,大多数特征向量有数百到数千个特征。我不了解你,但我无法想象一千维空间。)
Our example has two input features, meaning the feature vectors are two-dimensional. Since we have two dimensions, we can opt to make a plot of the training dataset. (If we have two or three features in a feature vector, we can plot the feature vectors. However, most feature vectors have hundreds to thousands of features. I don’t know about you, but I can’t visualize a thousand-dimensional space.)
图1-2展示了二维虹膜训练数据; x轴是花瓣长度,y轴是花瓣宽度。圆圈对应于I. setosa的实例,方块对应于 I. versicolor 的实例。每个圆圈或正方形代表一个训练样本,即特定花朵的花瓣长度和宽度。要放置每个点,请找到x轴上的花瓣长度和y轴上的花瓣宽度。然后,从x轴向上移动并从y轴向右移动。手指相交的地方就是代表那朵花的点。如果花是I. setosa,则将点设为圆形;否则,将其设为正方形。
Figure 1-2 displays the two-dimensional iris training data; the x-axis is petal length, and the y-axis is petal width. The circles correspond to instances of I. setosa and the squares I. versicolor. Each circle or square represents a single training sample, the petal length and width for a specific flower. To place each point, find the petal length on the x-axis and the petal width on the y-axis. Then, move up from the x-axis and to the right from the y-axis. Where your fingers meet is the point representing that flower. If the flower is I. setosa, make the point a circle; otherwise, make it a square.
图 1-2:虹膜训练数据
Figure 1-2: The iris training data
图 1-2中的图显示了训练集的特征空间。在这种情况下,我们可以直接可视化训练集,因为我们只有两个特征。当这不可能时,也不会失去一切。先进的算法允许我们绘制如图1-2所示的图,其中二维或三维的点反映了样本在高维空间中的分布。在这里,“空间”一词的含义与日常用语大致相同。
The plot in Figure 1-2 shows the feature space of the training set. In this case, we can visualize the training set directly, because we only have two features. When that’s not possible, all is not lost. Advanced algorithms exist that allow us to make plots like Figure 1-2 where the points in two or three dimensions reflect the distribution of the samples in the much higher-dimensional space. Here, the word space means much the same as it does in everyday parlance.
仔细观察图1-2。有什么东西跳到你身上吗?不同的班级是混在一起还是分开的?每个圆圈都位于图的左下角,而所有正方形都位于右上角。类之间没有重叠,这意味着它们在特征空间中完全独立。
Look carefully at Figure 1-2. Does anything jump out at you? Are the different classes mixed or well separated? Every circle inhabits the lower-left corner of the plot, while all of the squares are in the upper right. There is no overlap between the classes, meaning they are entirely separate in the feature space.
我们如何利用这个事实来制作一个分类器,一个对鸢尾花进行分类的模型? (虽然模型是更通用的术语,因为并非所有模型都将其输入放入类别中,但当它们这样做时,请使用术语“分类器”。)
How can we use this fact to make a classifier, a model that classifies iris flowers? (While model is the more general term, as not all models place their inputs into categories, when they do, use the term classifier.)
我们的分类器有许多模型类型可供选择,包括决策树,它生成一系列与用于决定给定输入输出的类标签的特征相关的是/否问题。当问题以视觉方式排列时,它们形成一个让人想起倒置树的结构。将决策树视为20 个问题游戏的计算机生成版本。
We have many model types to choose from for our classifier, including decision trees, which generate a series of yes/no questions related to the features used to decide the class label to output for a given input. When the questions are laid out visually, they form a structure reminiscent of an upside-down tree. Think of a decision tree as a computer-generated version of the game 20 Questions.
尽管我们有花瓣长度和花瓣宽度这两个特征,但我们可以通过问一个问题来对新的鸢尾花进行分类:花瓣长度是否小于 2.5 厘米?如果答案为“是”,则返回类别 0,I. setosa;否则,返回类别 1,I. versicolor。为了正确地对训练数据进行分类,我们只需要回答这个简单问题。
Even though we have two features, petal length and petal width, we can classify new iris flowers by asking a single question: is the petal length less than 2.5 cm? If the answer is “yes,” then return class 0, I. setosa; otherwise, return class 1, I. versicolor. To classify the training data correctly, we need only the answer to this simple question.
你明白我刚才做了什么吗?我说这个问题正确地对所有训练数据进行了分类。那我们没有使用的 20 个测试样本呢?我们的单问题分类器是否足以为每个问题提供正确的标签?在实践中,这就是我们想知道的,也是我们将报告为分类器性能的。
Did you catch what I did just now? I said that the question correctly classifies all the training data. What about the 20 test samples we didn’t use? Is our single-question classifier sufficient to give each of them the correct label? In practice, that’s what we want to know, and that is what we would report as the classifier’s performance.
图 1-3再次显示了训练数据,以及我们没有用于制作单问题分类器的测试数据。实心圆圈和正方形代表测试数据。
Figure 1-3 shows the training data again, along with the test data we didn’t use to make our single-question classifier. The solid circles and squares represent the test data.
图 1-3:虹膜训练数据与保留的测试数据(实线)
Figure 1-3: The iris training data with the held-out test data (solid)
测试数据均不违反我们的规则;通过询问花瓣长度是否小于 2.5 厘米,我们仍然可以获得正确的类别标签。因此,我们的模型是完美的;它不会犯任何错误。恭喜,您刚刚创建了第一个机器学习模型!
None of the test data violates our rule; we still get correct class labels by asking if the petal length is less than 2.5 cm. Therefore, our model is perfect; it makes no mistakes. Congratulations, you just created your first machine learning model!
我们应该高兴,但又不能太高兴。让我们重复这个练习,用剩下的鸢尾品种I. virginica替换I. setosa。这导致图 1-4,其中三角形是I. virginica的实例。
We should be happy, but not too happy. Let’s repeat this exercise, replacing I. setosa with the remaining iris species, I. virginica. This leads to Figure 1-4, where the triangles are instances of I. virginica.
图1-4:新的虹膜训练数据
Figure 1-4: The new iris training data
嗯,现在事情还没有那么明确。班级之间明显的差距消失了,而且它们重叠了。
Hmm, things are not as clear-cut now. The obvious gap between the classes is gone, and they overlap.
我使用这个新的鸢尾花数据集训练了一个决策树。和以前一样,有 80 个样本用于训练,20 个样本用于测试。这一次,模型并不完美。它正确地标记了 20 个样本中的 18 个,准确率为 10 中的 9,即 90%。这大致意味着,当该模型将一朵花分配给特定类别时,有 90% 的可能性是正确的。前一句话,为了严谨起见,需要仔细澄清,但现在,你明白了——机器学习模型并不总是完美的;机器学习模型并不总是完美的。他们(经常)犯错误。
I trained a decision tree using this new iris dataset. As before, there were 80 samples for training and 20 held back for testing. This time, the model wasn’t perfect. It correctly labeled 18 of the 20 samples, for an accuracy of 9 out of 10, or 90 percent. This roughly means that when this model assigns a flower to a particular class, there is a 90 percent chance it’s correct. The previous sentence, to be rigorous, needs careful clarification, but for now, you get the idea—machine learning models are not always perfect; they (quite frequently) make mistakes.
图1-5展示了学习到的决策树。从顶部(即根)开始,回答该框中的问题。如果答案是“是”,请移至左侧的方框;否则,向右移动。继续回答并以这种方式移动,直到到达一片叶子:一个没有箭头的盒子。该框中的标签分配给输入。
Figure 1-5 shows the learned decision tree. Begin at the top, which is the root, and answer the question in that box. If the answer is “yes,” move to the box on the left; otherwise, move to the right. Keep answering and moving in this way until you arrive at a leaf: a box with no arrows. The label in this box is assigned to the input.
图 1-5: I. virginica与I. versicolor的决策树
Figure 1-5: The decision tree for I. virginica versus I. versicolor
第一个决策树分类器很简单,因为单个问题的答案足以决定类成员资格。第二种决策树分类器更为常见。大多数机器学习模型都不是特别简单。尽管它们的运作方式是可以理解的,但理解它们为何这样做却是完全不同的事情。决策树是少数易于解释的模型类型之一。对于任何输入,图 1-5中从根到叶遍历的路径详细解释了输入收到特定标签的原因。现代人工智能背后的神经网络并不那么透明。
The first decision tree classifier was trivial, as the answer to a single question was sufficient to decide class membership. The second decision tree classifier is more common. Most machine learning models are not particularly simple. Though their operation is comprehensible, understanding why they act as they do is an entirely different matter. Decision trees are among the few model types that readily explain themselves. For any input, the path traversed from root to leaf in Figure 1-5 explains in detail why the input received a particular label. The neural networks behind modern AI are not so transparent.
****
****
为了使模型在“野外”表现良好,即在现实世界中使用时,用于训练模型的数据必须覆盖模型可能遇到的整个输入范围。例如,假设我们想要一个模型来识别狗的图片,而我们的训练集仅包含狗和鹦鹉的图像。虽然该模型在我们保留的测试集(其中还包括狗和鹦鹉的图片)上表现良好,但当我们部署该模型并遇到狼的图片时会发生什么?直觉上,我们可能期望模型会说“这是一只狗”,就像小孩子在了解狼是什么之前可能会说的那样。这正是大多数机器学习模型会做的事情。
For a model to perform well “in the wild,” meaning when used in the real world, the data used to train the model must cover the entire range of inputs that the model might encounter. For example, say we want a model to identify pictures of dogs, and our training set contains images of only dogs and parrots. While the model performs well on our held-out test set, which also includes pictures of dogs and parrots, what might happen when we deploy the model and it comes across a picture of a wolf? Intuitively, we might expect the model to say “it’s a dog,” just as a small child might before they learn what a wolf is. This is precisely what most machine learning models would do.
为了说明这一点,让我们尝试一个实验。所有人工智能研究人员都使用的一个流行数据集由数以万计的包含手写数字 0 到 9 的小图像组成。它有一个平淡无奇的名称 MNIST(修改版 NIST),因为它是在 20 世纪 90 年代末从由美国国家标准与技术研究院 (NIST) 是美国商务部的一个部门,负责为商业和工业领域的几乎所有事物实施各种标准。
To illustrate this, let’s try an experiment. A popular dataset used by all AI researchers consists of tens of thousands of small images containing handwritten digits, 0 through 9. It goes by the uninspiring name of MNIST (Modified NIST) because it was derived in the late 1990s from a dataset constructed by the National Institute of Standards and Technology (NIST), the division of the United States Department of Commerce tasked with implementing all manner of standards for just about everything in the commercial and industrial realm.
图 1-6展示了一些典型的 MNIST 数字样本。我们的目标是构建一个能够学习识别数字 0、1、3 和 9 的神经网络。我们可以在不知道神经网络如何工作的情况下训练它们,因为每个人都可以使用 scikit-learn 等功能强大的开源工具包。一方面,这使人工智能民主化;另一方面,一知半解往往是一件危险的事情。当模型存在缺陷时,它们可能看起来很好现实,并且缺乏对模型如何工作的了解可能会阻止我们在为时已晚之前认识到这一事实。
Figure 1-6 presents some typical MNIST digit samples. Our goal is to build a neural network that learns to identify the digits 0, 1, 3, and 9. We can train neural networks without knowing how they work because of powerful, open source toolkits like scikit-learn that are available to everyone. On the one hand, this democratizes AI; on the other, a little knowledge is often a dangerous thing. Models may appear to be good when they’re flawed in reality, and lack of knowledge about how the models work might prevent us from realizing that fact before it’s too late.
图 1-6:MNIST 数字示例
Figure 1-6: Sample MNIST digits
分类器训练完毕后,我们将向其提供四七八方的图像,向其抛出一些曲线球——这是人工智能在训练期间从未见过的输入。模型可以用这些输入做什么?
After the classifier is trained, we’ll throw it a few curveballs by handing it images of fours and sevens—inputs the AI never saw during training. What might the model do with such inputs?
我使用开源工具包训练了数字模型。目前,我们需要了解的数据集是输入特征向量是解开的数字图像;第一行像素后面是第二行,然后是第三行,依此类推,直到整个图像分解为一个长向量,即一串数字。数字图像为 28×28 像素,使得特征向量有 784 个数字长。我们要求神经网络学习 784 维空间中的事物,而不是我们之前使用的简单的二维空间,但机器学习能够应对这一挑战。
I trained the digits model using an open source toolkit. For now, all we need to know about the dataset is that the input feature vectors are unraveled digit images; the first row of pixels is followed by the second row, then the third row, and so on, until the entire image is unraveled into one long vector, a string of numbers. The digit images are 28×28 pixels, making the feature vector 784 numbers long. We’re asking the neural network to learn about things in a 784-dimensional space, rather than the simple 2-dimensional space we used previously, but machine learning is up to the challenge.
用于调节神经网络模型的训练集包含 24,745 个样本,每种数字类型(0、1、3 和 9)大约有 6,000 个样本。这可能足以公平地表示模型在使用时可能遇到的数字类型,但我们需要尝试一下才能知道。人工智能在很大程度上是一门经验科学。
The training set used to condition the neural network model contained 24,745 samples, roughly 6,000 of each digit type (0, 1, 3, and 9). This is likely enough to fairly represent the types of digits the model might encounter when used, but we need to try it to know. AI is a largely empirical science.
保留的测试集还包含数字 0、1、3 和 9,有 4,134 个样本(每个数字约 1,000 个)。
The held-out test set, also containing the digits 0, 1, 3, and 9, had 4,134 samples (about 1,000 for each digit).
我们将使用混淆矩阵(二维数字表)来评估模型。混淆矩阵是评估模型的最常见方法,因为它们显示了模型在测试数据上的表现。
We’ll use a confusion matrix, a two-dimensional table of numbers, to evaluate the model. Confusion matrices are the most common way to evaluate a model because they show how it behaves on the test data.
在这种情况下,我们的数字分类器的混淆矩阵如表 1-1所示。
In this case, the confusion matrix for our digit classifier is shown in Table 1-1.
表 1-1:数字分类器混淆矩阵
Table 1-1: The Digit Classifier Confusion Matrix
|
0 0 |
1 1 |
3 3 |
9 9 |
|---|---|---|---|---|
0 0 |
978 978 |
0 0 |
1 1 |
1 1 |
1 1 |
2 2 |
1,128 1,128 |
3 3 |
2 2 |
3 3 |
5 5 |
0 0 |
997 997 |
8 8 |
9 9 |
5 5 |
1 1 |
8 8 |
995 995 |
矩阵行代表给予模型的样本的真实标签。这些列是模型的响应。表中的值是计数,即输入类和模型分配标签的每种可能组合发生的次数。
The matrix rows represent the true labels for the samples given to the model. The columns are the model’s responses. The values in the table are counts, the number of times each possible combination of input class and model-assigned label happened.
例如,第一行表示测试集中的零。在这 980 个输入中,该模型为其中 978 个返回了零标签,但它表示输入一次是 3,另一次是 9。因此,当零是输入,模型的输出在 980 次中正确 978 次。这令人鼓舞。
For example, the first row represents the zeros in the test set. Of those 980 inputs, the model returned a label of zero for 978 of them, but it said the input was a three once and a nine another time. Therefore, when zero was the input, the model’s output was correct 978 out of 980 times. That’s encouraging.
同样,当输入为 1 时,模型返回正确标签 1,128 次。三分球正确 997 次,九分球正确 995 次。当分类器良好时,混淆矩阵从左上到右下对角线上的数字很高,并且该对角线以外几乎没有数字。非对角线数字是模型产生的错误。
Similarly, when the input was a one, the model returned the correct label 1,128 times. It was right 997 times for threes and 995 times for nines. When a classifier is good, the numbers along the diagonal of the confusion matrix from upper left to lower right are high, and there are almost no numbers off that diagonal. Off-diagonal numbers are errors made by the model.
总体而言,数字模型的准确度为 99%。我们有一个可靠的、性能良好的模型——也就是说,如果我们能够确保模型的所有输入确实是 0、1、3 或 9。但如果不是呢?
Overall, the digits model is 99 percent accurate. We have a solid, well-performing model—that is, if we can ensure that all inputs to the model are indeed a 0, 1, 3, or 9. But what if they aren’t?
我递给了型号为 982 的四张纸。模型是这样回答的:
I handed the model 982 fours. The model replied like this:
0 0 |
1 1 |
3 3 |
9 9 |
|---|---|---|---|
48 48 |
9 9 |
8 8 |
917 917 |
换句话说,该模型为 982 个四元组中的 917 个返回标签 9,为 48 个四元组返回标签 1,为其余的四元组返回标签 1 或 3。七人制怎么样?
In other words, the model returned a label of 9 for 917 of the 982 fours, a label of 1 for 48 fours, and labels of 1 or 3 for the rest. How about sevens?
0 0 |
1 1 |
3 3 |
9 9 |
|---|---|---|---|
19 19 |
20 20 |
227 227 |
第762章 762 |
该模型仍然倾向于将七分为九,但也经常将其称为三。神经网络在解释其行为时不愿意透露自己的秘密,但在本例中,在标记为三的 227 个七人制中,其中 47 个是带有斜线的欧洲式七人制。从整个数据集中随机抽取 227 个七人制比赛样本,只得出 24 个欧洲风格的七人制比赛。这种比较在数学上并不严格,但它暗示带斜线的 7 通常足够接近 3,足以欺骗模型。
The model still favored calling sevens nines, but it often called them threes as well. Neural networks are loath to give up their secrets when explaining their actions, but in this case, of the 227 sevens labeled as threes, 47 of them were European-style sevens with a slash. A random sample of 227 sevens from the entire dataset turned up only 24 European-style sevens. The comparison isn’t rigorous mathematically, but it hints that sevens with a slash are often close enough to a three to fool the model.
该模型从未被教导如何识别四或七,所以它做了第二好的事情,并将它们放在附近的类别中。例如,根据它们的书写方式,人们有时可能会将四和七混淆为九。该模型正在犯人们常犯的错误,这很有趣,但更重要的是,该模型很差,因为它没有接受可能遇到的全方位输入的训练。它无法说“我不知道”,并且让模型可靠地说出这一点可能很棘手。
The model was never taught to recognize fours or sevens, so it did the next best thing and placed them in a nearby category. Depending on how they’re written, people might sometimes confuse fours and sevens for nines, for example. The model is making the kind of mistakes people make, which is interesting—but, more significantly, the model is poor because it wasn’t trained on the full range of inputs it might encounter. It has no way of saying “I don’t know,” and getting a model to reliably say this can be tricky.
这是一个简单的练习,但意义深远。如果模型不是在数字中寻找医学图像中的癌症,但从未接受过识别重要病变类别或病变可能采取的所有形式的训练,该怎么办?正确构建和全面的数据集可能意味着生与死的区别。
This is a simple exercise, but the implications are profound. Instead of digits, what if the model was looking for cancer in medical images but was never trained to recognize an important category of lesion or all the forms that lesion might take? A properly constructed and comprehensive dataset might mean the difference between life and death.
****
****
我们还可以从插值和外推的角度来思考数字示例。插值法在已知数据范围内进行近似,而外推法则超出已知数据范围。
We can also think about the digits example in terms of interpolation and extrapolation. Interpolation approximates within the range of known data, and extrapolation goes beyond known data.
对于数字示例,插值可能是指当训练集中没有特别倾斜的零时,在野外遇到倾斜的零。从某种意义上说,模型必须进行插值才能正确响应。外推法更像是对带有斜杠的零进行分类,这在训练期间是看不到的。为了更好地理解这些术语,我们对 1950 年到 2020 年的世界人口进行建模。
For the digits example, interpolation might refer to encountering a tilted zero in the wild when none of the zeros in the training set were particularly tilted. The model must interpolate, in a sense, to respond correctly. Extrapolation is more like classifying a zero with a slash through it, which is something unseen during training time. To better understand these terms, let’s model the world population from 1950 through 2020.
首先,我们将对 1950 年到 1970 年的数据拟合一条直线。拟合直线是曲线拟合的一种形式;它是曲线拟合的一种形式。将其视为机器学习不太复杂的表亲。要拟合直线,请找到两个数字:斜率和截距。斜率告诉我们这条线有多陡。如果斜率为正,则当我们沿着图形的x轴从左向右移动时,该线会增加。负斜率意味着当我们沿着x轴移动时,线会减小。截距是直线与y轴相交的位置;即输入为零时的值。
First, we’ll fit a line to the data from 1950 through 1970. Fitting a line is a form of curve fitting; think of it as machine learning’s less sophisticated cousin. To fit a line, find two numbers: the slope and the intercept. The slope tells us how steep the line is. If the slope is positive, the line is increasing as we move from left to right along the x-axis of a graph. A negative slope means the line decreases as we move along the x-axis. The intercept is where the line intersects the y-axis; that is, the value when the input is zero.
为了拟合一条线,我们使用一种算法来查找最能表征数据的斜率和截距(此处为 1950 年至 1970 年的世界人口)。图 1-7显示了按年份划分的线图和实际人口数(用加号表示)。该线穿过或接近大多数加号,因此拟合合理。请注意,人口有数十亿。
To fit a line, we use an algorithm to find the slope and intercept that best characterize the data (here, world population from 1950 through 1970). Figure 1-7 shows a plot of the line and the actual populations by year, denoted by plus signs. The line passes through or near to most of the plus signs, so the fit is reasonable. Notice that the population is in billions.
图 1-7:1950 年至 1970 年世界人口
Figure 1-7: World population from 1950 through 1970
一旦我们有了这条线,我们就可以使用斜率和截距来估计任何一年的人口。对 1950 年至 1970 年之间年份的估计是插值,因为我们使用该年份范围的数据来创建这条线。如果我们估计 1950 年之前或 1970 年之后的人口,我们就是在推断。表 1-2显示了我们插值时的结果。
Once we have the line, we can use the slope and intercept to estimate the population for any year. Estimating for years between 1950 and 1970 is interpolating, because we used data from that range of years to create the line. If we estimate populations for years before 1950 or after 1970, we are extrapolating. Table 1-2 shows our results when interpolating.
表 1-2: 1950 年至 1970 年人口插值
Table 1-2: Interpolating the Population Between 1950 and 1970
年 Year |
插值 Interpolated |
实际的 Actual |
|---|---|---|
1954年 1954 |
2.71 2.71 |
2.72 2.72 |
1960年 1960 |
3.06 3.06 |
3.03 3.03 |
1966年 1966 |
3.41 3.41 |
3.41 3.41 |
插值人口值非常接近实际人口值,这意味着模型(此处为数据拟合线)表现良好。现在,让我们推断出拟合范围之外的日期,如表 1-3所示。
The interpolated population values are quite close to the actual population values, meaning the model (here the line fit to the data) is doing well. Now, let’s extrapolate to dates outside the fit range, as shown in Table 1-3.
表 1-3: 1970 年后的人口推断
Table 1-3: Extrapolating the Population After 1970
年 Year |
外推法 Extrapolated |
实际的 Actual |
|---|---|---|
1995年 1995 |
5.10 5.10 |
5.74 5.74 |
2010年 2010 |
5.98 5.98 |
6.96 6.96 |
2020年 2020 |
6.56 6.56 |
7.79 7.79 |
推断的人口值与实际人口之间的差异逐年扩大。该模型表现不佳。绘制从 1950 年到 2020 年的整个范围揭示了这个问题;见图1-8。
The difference between the extrapolated population values and the actual population is increasing with each year. The model isn’t doing well. Plotting the full range from 1950 through 2020 reveals the problem; see Figure 1-8.
图 1-8:1950 年至 2020 年世界人口
Figure 1-8: World population from 1950 through 2020
随着时间的推移,拟合线变得越来越错误,因为数据毕竟不是线性的。也就是说,增长率不是恒定的并且不遵循直线。
As time goes by, the fit line becomes increasingly wrong because the data is not linear after all. That is, the rate of growth is not constant and doesn’t follow a straight line.
当进行推断时,我们可能有理由相信数据将继续符合这条线;如果这是一个有效的假设,那么这条线将继续是一个很好的选择。然而,在现实世界中,我们通常没有这样的保证。因此,作为一个口号,我们可以说插值法好,外推法坏。
When extrapolating, we might have reason to believe that the data will continue to fit the line; if that’s a valid assumption, then the line will continue to be a good fit. However, in the real world, we usually have no such assurance. So, as a slogan, we might say interpolation good, extrapolation bad.
对某些数据拟合一条线是曲线拟合的一个示例。对于曲线拟合来说是这样,对于人工智能来说也是如此。当给定的输入接近训练识别的数据时,手写数字模型表现良好。测试数据中的数字都是0、1、3和9的实例,因此测试数据就像训练数据一样。这两个数据集来自相同的分布,并且相同的数据生成过程创建了两者。因此,我们可以声称该模型在某种程度上是在这些情况下进行插值的。然而,当我们强迫模型做出关于四人制和七人制的决定时,我们是通过让模型对训练期间从未见过的数据做出决定来进行推断。
Fitting a line to some data is an example of curve fitting. What is true for curve fitting is also true for AI. The handwritten digits model did well when given inputs close to the data it was trained to recognize. The digits in the test data were all instances of 0, 1, 3, and 9, so the test data was like the training data. The two datasets are from the same distribution, and the same data-generating process created both. We can therefore claim that the model was, in a way, interpolating in those cases. However, when we forced the model to make decisions about fours and sevens, we were extrapolating by having the model make decisions about data it never saw during training.
值得重复的是:插值法好,外推法坏。糟糕的数据集会导致糟糕的模型;好的数据集会产生好的模型,但当被迫推断时,模型会表现得很糟糕。而且,为了更好地衡量:所有模型都是错误的,但有些模型是有用的。
It bears repeating: interpolation good, extrapolation bad. Bad datasets lead to bad models; good datasets lead to good models, which behave badly when forced to extrapolate. And, for good measure: all models are wrong, but some are useful.
****
****
沿着 Hilaire Belloc 1907 年出版的《儿童警示故事》一书的相同思路——对儿童所做的可能导致不幸结局的愚蠢事情进行了有趣而又有点恐怖的审视——让我们来看看人工智能从业者在训练、测试、最重要的是,部署模型。
Along the same lines of Hilaire Belloc’s 1907 book Cautionary Tales for Children—an amusing and somewhat horrifying look at foolish things children do that could lead to an unfortunate end—let’s examine some cautionary tales that AI practitioners should be aware of when training, testing, and, most of all, deploying models.
2016 年,我参加了一次会议演讲,演讲者展示了关于理解神经网络为何选择其工作方式的研究。这还不是一个已解决的问题,但已经取得了进展。在这种情况下,研究标记了影响模型决策的输入图像部分。
In 2016, I attended a conference talk where the presenter demonstrated research into understanding why a neural network chooses the way it does. This is not yet a solved problem, but progress has been made. In this case, the research marked parts of the input images that influenced the model’s decision.
演讲者展示了哈士奇和狼的图片,并讨论了区分两者的分类器。他展示了它在测试集上的表现,并向机器学习研究人员询问这是否是一个好的模型。许多人答应了,但又犹豫了,因为他们预计会遇到陷阱。他们的犹豫是对的。然后,演讲者对图像进行标记,以显示神经网络在做出决策时关注的部分。该模型没有关注狗或狼。相反,该模型注意到所有狼训练图像的背景都有雪,而狗图像中没有一个包含雪。该模型对狗和狼一无所知,只了解雪和无雪。粗心地接受模型的行为不会揭示这一事实,并且该模型的部署可能只会在野外失败。
The speaker displayed pictures of huskies and wolves and discussed his classifier for differentiating between the two. He showed how well it performed on a test set and asked the audience of machine learning researchers if this was a good model. Many people said yes, but with hesitation because they expected a trap. They were right to be hesitant. The speaker then marked the images to show the parts that the neural network focused on when making its decisions. The model wasn’t paying attention to the dogs or the wolves. Instead, the model noticed that all the wolf training images had snow in the background, while none of the dog images contained snow. The model learned nothing about dogs and wolves but only about snow and no snow. Careless acceptance of the model’s behavior wouldn’t have revealed that fact, and the model might have been deployed only to fail in the wild.
20 世纪 50 年代或 1960 年代的早期机器学习系统也有类似的故事。这可能是杜撰的,尽管我读过那个时期的一篇论文,它可能是都市传说的起源。在本例中,图像是森林的鸟瞰图。有些图像包含坦克,而其他图像则没有。
A similar tale is told of a very early machine learning system from the 1950s or 1960s. This one is likely apocryphal, though I have read a paper from that period that might be the origin of the urban legend. In this case, the images were bird’s-eye views of forests. Some images contained a tank, while others did not.
一个经过训练来检测坦克的模型似乎在训练数据上运行良好,但在野外释放时却惨遭失败。最终发现一组训练图像是在晴天拍摄的,另一组是在阴天拍摄的。该模型没有学到任何其创建者假设的东西。
A model trained to detect tanks seemed to work well on the training data but failed miserably when set loose in the wild. It was eventually realized that one set of training images had been taken on a sunny day and the other on a cloudy day. The model had learned nothing that its creators assumed it had.
这种现象的最新例子存在于更先进的机器学习模型中。有些人甚至愚弄专家,让他们相信该模型已经学到了有关语言等的基本知识,而实际上,它已经在训练数据中学到了人类无法(轻易)检测到的极其微妙的相关性。
More recent examples of this phenomenon exist with more advanced machine learning models. Some have even fooled experts into believing the model had learned something fundamental about language or the like when, instead, it had learned extremely subtle correlations in the training data that no human could (easily) detect.
相关性这个词有严格的数学含义,但我们用“相关性并不意味着因果关系”这句话来抓住它的本质。相关性是指两个事物相互联系,一个事物的发生意味着另一个事物的发生,通常以特定的顺序。更具体地说,相关性衡量一件事的变化与另一件事的变化的关联程度。如果两者都增加,则它们呈正相关。如果一个增加而另一个减少,则它们呈负相关。
The word correlation has a strict mathematical meaning, but we capture its essence with the phrase “correlation does not imply causation.” Correlation is when two things are linked so that the occurrence of one implies the occurrence of the other, often in a particular order. More concretely, correlation measures how strongly a change in one thing is associated with a change in another. If both increase, they are positively correlated. If one increases while the other decreases, they are negatively correlated.
例如,公鸡打鸣,太阳升起。这两个事件都是与时间相关的:首先是公鸡,然后是太阳。这种相关性并不意味着因果关系,因为公鸡打鸣不会导致太阳升起,但如果经常观察到这种相关性,即使没有真正的证据,人类的大脑也会开始认为一个因素导致了另一个因素。这。人类为什么会这样做并不难理解。进化有利于做出这种关联的早期人类,因为有时这些关联会导致有利于生存的行为。
For example, a rooster crows, and the sun comes up. The two events are time-dependent: the rooster first, then the sun. This correlation does not imply causation, as the rooster crowing doesn’t cause the sun to rise, but if such a correlation is observed often enough, the human mind begins to see one as causing the other, even when there is no real evidence of this. Why humans act this way isn’t hard to understand. Evolution favored early humans who made such associations because, sometimes, the associations led to behavior beneficial for survival.
“相关性并不意味着因果关系”也适用于人工智能。上述模型学会了检测训练数据中与预期目标(狗、狼、坦克)相关的事物,但没有了解目标本身。精明的机器学习从业者总是在寻找这种虚假的相关性。使用大型且高度多样化的数据集进行训练和测试可以抵御这种影响,尽管这在实践中并不总是可行。
“Correlation does not imply causation” also applies to AI. The aforementioned models learned to detect things in the training data that correlated with the intended targets (dogs, wolves, tanks) but didn’t learn about the targets themselves. Savvy machine learning practitioners are always on the lookout for such spurious correlations. Using a large and highly diverse dataset for training and testing can defend against this effect, though this isn’t always possible in practice.
我们必须问我们的模型是否已经学到了我们假设的东西。而且,正如我们在 MNIST 数字中看到的那样,我们必须确保我们的模型已经看到了它们在野外会遇到的所有类型的输入 - 它们应该进行插值,而不是外推。
We must ask whether our models have learned what we assume they have. And, as we saw with the MNIST digits, we must ensure that our models have seen all the kinds of inputs they will encounter in the wild—they should interpolate, not extrapolate.
这比最初看起来更重要。谷歌在 2015 年吸取了这个教训,当时它为谷歌照片部署了一项功能,其中该模型没有对人脸进行充分的训练,并做出了不正确和不恰当的关联。无论是在一般意义上还是在社会意义上,偏见都是人工智能中的一个现实问题。
This matters more than it might initially appear. Google learned this lesson in 2015 when it deployed a feature for Google Photos, wherein the model was insufficiently trained on human faces and made incorrect and inappropriate associations. Bias, in both the generic and social senses, is a real issue in AI.
让我们用 MNIST 数字进行另一个实验。这次,模型需要做出一个看似简单的决定:输入的数字是 9 吗?该模型与之前使用的神经网络相同。如果在每个图像都是 9 或除 4 或 7 之外的任何其他数字(即训练数据中没有 4 或 7)的数据集上进行训练,则模型的准确度为 99%,如混淆矩阵所示:
Let’s perform another experiment with MNIST digits. This time, the model has a seemingly simple decision to make: is the input digit a nine? The model is the same neural network used previously. If trained on a dataset where every image is either a nine or any other digit except four or seven (that is, no fours or sevens are in the training data), then the model is 99 percent accurate, as the confusion matrix shows:
|
不是 9 Not 9 |
9 9 |
|---|---|---|
不是 9 Not 9 |
9,754 9,754 |
23 23 |
9 9 |
38 38 |
1,362 1,362 |
混淆矩阵告诉我们,该模型正确标记了 9,777 个测试图像中的 9,754 个不是 9 的图像。该模型的标签对于 1,400 个 9 中的 1,362 个也是正确的。虽然模型在测试集上表现良好,但该集不包含四人制或七人制。
The confusion matrix tells us that the model correctly labeled 9,754 out of 9,777 test images that were not a nine. The model’s label was also correct for 1,362 of the 1,400 nines. While the model performs well on the test set, the set does not contain fours or sevens.
在这种情况下,混淆矩阵很小,因为模型只有两个类:九个或非九个。换句话说,这是一个二元模型。
In this case, the confusion matrix is small because the model has only two classes: nine or not nine. In other words, this is a binary model.
矩阵右上角的 23 代表输入不是 9 但模型表示是 9 的 23 次。对于二元模型,类别 1 通常被视为感兴趣的类别或正类。因此,这 23 个输入代表误报,因为模型说“这是 9”,但事实并非如此。同样,左下角的 38 个样本也是假阴性,因为模型说“这不是 9”,而输入实际上是 9。我们希望模型没有误报或漏报,但有时最小化其中一个比另一个更重要。
The 23 in the upper-right corner of the matrix represents 23 times when the input wasn’t a nine, but the model said it was. For a binary model, class 1 is usually considered the class of interest, or the positive class. Therefore, these 23 inputs represent false positives, because the model said “it’s a nine” when it wasn’t. Similarly, the 38 samples at the lower left are false negatives because the model said “it’s not a nine” when the input actually was. We want models with no false positives or negatives, but sometimes it’s more important to minimize one than the other.
例如,如果模型要在乳房 X 光检查中检测乳腺癌,则假阳性代表模型说“这可能是癌症”的情况,尽管事实并非如此。这听起来很可怕,但进一步的测试将表明该模型是错误的。然而,假阴性代表漏诊了癌症。如果模型几乎没有假阴性,我们可能会容忍具有更多假阳性的模型,因为假阳性比假阴性的灾难性要小。我们开始意识到充分训练、表征、测试和理解我们的机器学习模型是多么重要。
For example, if a model is to detect breast cancer in mammograms, a false positive represents a case where the model says, “it might be cancer,” even though it isn’t. That’s scary to hear, but further testing will show that the model was wrong. However, a false negative represents a missed cancer. We might tolerate a model with more false positives if it also has virtually no false negatives, as a false positive is less catastrophic than a false negative. We’re beginning to appreciate how important it is to fully train, characterize, test, and understand our machine learning models.
****
****
好吧,回到我们的实验。 “这是九”分类器,就像我们早期的 MNIST 模型一样,对四或七一无所知。当显示四和七时,MNIST 模型通常将其称为九。这个模型也会做同样的事情吗?这是我在四人制和七人制模型中收到的信息:
All right, back to our experiment. The “is it a nine” classifier, like our earlier MNIST model, knows nothing about fours or sevens. When shown fours and sevens, the MNIST model typically called them nines. Will this model do the same? Here’s what I received when I gave the model fours and sevens:
|
不是 9 Not 9 |
9 9 |
|---|---|---|
不是 9 Not 9 |
5,014 5,014 |
9,103 9,103 |
该模型将 14,117 个四和七中的 9,103 个标记为 9。这个比例略高于 65%,即大约三分之二。这模仿了我们向模型提供从未经过训练来检测的输入类型的情况。
The model marked 9,103 of the 14,117 fours and sevens as nines. That’s slightly more than 65 percent, or roughly two out of every three. This mimics the case where we present the model with inputs of a type it was never trained to detect.
让我们通过向训练集中添加四肢和七肢来帮助模型。希望提供“看起来像 9,但实际上不是”的例子(正式称为“硬负例” )能够改进模型。我在训练数据中做出了 3% 的四七分。整个模型的准确度和以前一样,为 99%,下面是当我对它进行七次四次的比赛时发生的情况,它以前从未见过:
Let’s help the model by adding fours and sevens to the training set. Hopefully, providing examples that say, “It looks like a nine, but it isn’t,” formally known as hard negatives, will improve the model. I made 3 percent of the training data fours and sevens. The overall model was just as accurate as before, 99 percent, and here’s what happened when I gave it fours and sevens it had never seen before:
|
不是 9 Not 9 |
9 9 |
|---|---|---|
不是 9 Not 9 |
9,385 9,385 |
3,321 3,321 |
这样更好。该模型没有将四或七个输入中的三分之二称为“九”,而是仅将四分之一标记为“九”。即使是一些看起来像正类但实际上并非如此的例子也能有所帮助。如果我将训练集中四人七人制的比例提高到 18%,则模型对四人七人制错误分类的概率将低于 1%。由于模型是从数据中学习的,因此我们必须使用尽可能完整的数据集,以便我们的模型能够进行插值而不是外推。
That’s better. Instead of calling two-thirds of four or seven inputs a nine, the model labeled only one in four as a nine. Even a few examples of things that look like the positive class but aren’t can help. If I boost the proportion of fours and sevens in the training set to 18 percent, the model misclassifies fours and sevens less than 1 percent of the time. Because models learn from data, we must use datasets that are as complete as possible so our models interpolate and do not extrapolate.
笔记
NOTE
为了完全准确,最近的研究表明,现代深度学习模型几乎总是推断,但输入与模型训练数据越相似,性能就越好,所以我觉得使用这个类比是合理的。
To be completely accurate, recent research shows that modern deep learning models are almost always extrapolating, but the more similar the inputs are to the data on which the model was trained, the better the performance, so I feel justified in using the analogy.
每个想要了解人工智能(更不用说使用人工智能)的人都必须牢记有关用于训练人工智能模型的数据质量的警告。Michael Roberts 等人于2021 年在《Nature Machine Intelligence》杂志上发表的一篇研究文章“使用胸部 X 线照片和 CT 扫描使用机器学习检测和预测 COVID-19 的常见陷阱和建议”就是一个发人深省的例子。作者评估了旨在通过胸部 X 光和 CT 扫描检测 COVID-19 的机器学习模型的性能,将 2,000 多项研究(模型)的初始候选池减少到 62 个以进行严格测试。最后,作者宣称由于构造缺陷、数据集偏差或两者兼而有之,没有一个模型适合临床使用。
Everyone who seeks to understand, let alone work with, AI must take the warnings about the quality of the data used to train AI models to heart. A 2021 research article published in the journal Nature Machine Intelligence by Michael Roberts et al., “Common Pitfalls and Recommendations for Using Machine Learning to Detect and Prognosticate for COVID-19 Using Chest Radiographs and CT Scans,” is a sobering example. The authors assessed the performance of machine learning models designed to detect COVID-19 in chest X-rays and CT scans, reducing the initial candidate pool of over 2,000 studies (models) to 62 for rigorous testing. In the end, the authors declared none of the models fit for clinical use because of flaws in construction, bias in the datasets, or both.
诸如此类的结果催生了可解释人工智能的诞生,这是一个旨在赋予模型自我解释能力的子领域。
Results like these have led to the creation of explainable AI, a subfield that seeks to give models the ability to explain themselves.
查看您的数据并尝试尽可能了解您的模型正在做什么以及为什么。
Look at your data and try to understand, as far as humanly possible, what your model is doing and why.
****
****
本章的标题“我们走吧”是喜剧演员杰基·格里森的口号。在回过头来更深入地了解事物之前,深入研究一个主题以获得一个概述通常是好的。换句话说,我们在更系统地探索之前急于了解该主题。
This chapter’s title, “And Away We Go,” was comedian Jackie Gleason’s tagline. It’s often good to dive into a subject to get an overview before coming back to understand things at a deeper level. In other words, we rush in to get a feel for the topic before exploring more methodically.
您将在本书末尾的术语表中找到本章中介绍的许多新术语和概念。我的目标不是让你现在就理解它们,更不用说保留它们,而是播下种子,以便下次你遇到这些术语或概念之一时,你更有可能会想:“啊,我知道那个。”后面的章节会强化它们,您将通过反复接触了解重要的内容。
You’ll find the many new terms and concepts introduced in this chapter in the glossary at the end of the book. My goal isn’t for you to understand them all now, let alone retain them, but to plant seeds so that the next time you encounter one of these terms or concepts, you’ll be more likely to think, “Ah, I know that one.” Later chapters reinforce them, and you’ll learn the important ones via repeated exposure.
本章的要点有两类。第一个与人工智能是什么及其基本组成部分有关。第二个是关于建立关于人工智能提供什么以及我们应该如何应对的直觉。
There are two categories of takeaways from this chapter. The first has to do with what AI is and its essential pieces. The second is about building intuition about what AI offers and how we should respond.
人工智能涉及模型,但我们可以用数据来调节模糊实体来执行某些所需的任务。人工智能模型有很多种,本章介绍了两种:决策树和神经网络。我不会更多地谈论决策树,但神经网络占据了本书其余部分的大部分内容。
AI involves models, as yet nebulous entities we can condition with data to perform some desired task. There are many types of AI models, and this chapter introduced two: decision trees and neural networks. I won’t say much more about decision trees, but neural networks occupy most of the remainder of the book.
模型通常最好被视为函数,就像您在学校中记住的数学函数或构成大多数计算机程序核心的函数一样。两者都可以被视为黑匣子,其中有东西进去(输入),有东西出来(输出)。在人工智能中,输入是一个特征向量,是适合当前任务的任何内容的集合。在本章中,我们使用了两个特征向量:花朵的测量值和手写数字的图像。
Models are often best thought of as functions, like the mathematical functions you may remember from school or the functions that form the core of most computer programs. Both can be considered black boxes, where something goes in (the input) and something comes out (the output). In AI, the input is a feature vector, a collection of whatever is appropriate for the task at hand. In this chapter, we used two feature vectors: measurements of a flower and images of a handwritten digit.
训练通过改变模型参数来调整模型,使其尽可能准确。在训练大多数模型时,有必要谨慎行事,以了解数据的一般特征,而不是虚假相关性或训练集的微小细节(称为过度拟合的概念,我们将在第 4 章中讨论)。
Training conditions the model by altering its parameters to make it as accurate as possible. It’s necessary to exercise caution when training most models to learn the general features of the data and not spurious correlations or the minute details of the training set (a concept known as overfitting, which we’ll discuss in Chapter 4).
机器学习模型的正确开发意味着我们必须有一个测试集,即训练时不使用的已知输入和输出对的集合。我们在训练后使用这个集合来评估模型。如果数据集构建正确,测试集可以让我们了解模型在野外的表现如何。
Proper development of machine learning models means we must have a test set, a collection of known input and output pairs that we do not use when training. We use this set after training to evaluate the model. If the dataset is constructed correctly, the test set provides an idea of how well we can expect the model to perform in the wild.
第二个要点涉及人工智能提供什么以及我们应该如何应对它。虽然人工智能很强大,但它并不像我们那样思考(尽管第七章的模型可能不同意)。人工智能的生死取决于数据,它的好坏取决于我们提供给它的数据。如果数据集有偏差,那么人工智能也会有偏差。如果数据集忽略包含使用时会遇到的输入类型的示例,人工智能将无法正确处理此类输入。
The second takeaway relates to what AI offers and how we should respond to it. While AI is powerful, it doesn’t think as we do (though the models of Chapter 7 might disagree). AI lives and dies by data and is only as good as the data we feed to it. If the dataset is biased, the AI is biased. If the dataset neglects to include examples of the types of inputs it will encounter when used, the AI will fail to handle such inputs properly.
本章的例子警告我们在假设人工智能按预期运行时要小心。该模型是否学到了我们希望它学到的东西?是否受到我们没有注意到的数据相关性的影响,或者更糟糕的是,我们的能力太有限而无法辨别?回想一下哈士奇与狼的例子。
The chapter’s examples warn us to be careful when assuming AI operates as intended. Did the model learn what we wanted it to learn? Was it influenced by correlations in the data that we didn’t notice or, worse still, that we are too limited to discern? Think back to the huskies versus wolves example.
因为人工智能的好坏取决于输入的数据,所以我们有责任使数据集公平、公正,并在没有假设的情况下了解人工智能真正学到了什么。
Because AI is only as good as the data fed to it, it’s on us to make datasets fair and unbiased and to understand what the AI has truly learned without assumptions.
人工智能首次出现于 20 世纪 50 年代,那么为什么现在它突然无处不在呢?下一章将回答这个问题。
AI first appeared in the 1950s, so why is it now suddenly everywhere we look? The next chapter answers this question.
关键术语
KEY TERMS
算法、人工智能、分类器、类标签、混淆矩阵、数据集、决策树、深度学习、可解释的人工智能、特征、特征向量、机器学习、模型、神经网络、参数、测试、训练
algorithm, artificial intelligence, classifier, class label, confusion matrix, dataset, decision tree, deep learning, explainable AI, feature, feature vector, machine learning, model, neural network, parameters, testing, training
罗温·阿特金森的喜剧杰作《憨豆先生》在夜深人静的伦敦街头上映。聚光灯出现,主角从天而降,合唱团用拉丁语唱着“eccehomoquiestfaba”——看哪,这个人是一颗豆子。憨豆先生站了起来,掸掉身上的衣服,笨拙地跑进黑暗中。他是一个超凡脱俗的东西,一个字面上从天上掉下来的东西,难以理解。
Rowan Atkinson’s comic masterpiece Mr. Bean opens in the dead of night on a deserted London street. A spotlight appears, the title character falls from the sky, and a choir sings in Latin, “ecce homo qui est faba”—behold the man who is a bean. Mr. Bean picks himself up, brushes off his suit, and runs awkwardly into the darkness. He is something otherworldly, a thing that literally fell from the sky, defying comprehension.
考虑到近年来人工智能奇迹不断出现,我们可能会认为人工智能就像憨豆先生一样,从天而降,完全成型,超出了我们的理解范围。然而,这一切都不是真的。事实上,我认为人工智能仍处于起步阶段。
Given the parade of AI wonder after wonder in recent years, we might be excused for thinking that AI, like Mr. Bean, fell from the sky, fully formed and beyond our comprehension. However, none of this is true; indeed, I’d argue that AI is still in its infancy.
那么为什么我们现在听到人工智能呢?我将通过人工智能的简短(且有偏见的)历史来回答这个问题,然后讨论作为人工智能革命催化剂的计算进步。本章为我们将在本书的其余部分中探索的模型提供了背景。
So why are we hearing about AI now? I’ll answer that question with a brief (and biased) history of AI, followed by a discussion of the advances in computing that acted as the catalyst for the AI revolution. This chapter provides context for the models we’ll explore throughout the remainder of the book.
****
****
自诞生以来,人工智能被分为两大阵营:符号人工智能和联结主义。符号人工智能试图通过操纵符号和逻辑陈述或关联来模拟智能。然而,联结主义试图通过构建更简单组件的网络来模拟智能。人类的思维体现了这两种方法。我们使用符号作为思想和语言的元素,我们的思维是由极其复杂的神经元网络构成的,每个神经元都是一个简单的处理器。在计算机编程术语中,人工智能的符号方法是自上而下的,而联结主义是自下而上的。自上而下的设计从高级任务开始,然后将这些任务分解为越来越小的部分。自下而上的设计从较小的部分开始,然后将它们组合在一起。
Since its inception, AI has been divided into two main camps: symbolic AI and connectionism. Symbolic AI attempts to model intelligence by manipulating symbols and logical statements or associations. Connectionism, however, attempts to model intelligence by building networks of simpler components. The human mind embodies both approaches. We use symbols as elements of thought and language, and our minds are constructed from unbelievably complex networks of neurons, each neuron a simple processor. In computer programming terms, the symbolic approach to AI is top-down, while connectionism is bottom-up. Top-down design starts with high-level tasks, then breaks those tasks into smaller and smaller pieces. A bottom-up design begins with smaller pieces and combines them together.
符号人工智能的支持者认为,智能可以在抽象中实现,而不需要像大脑那样的底层。联结主义者遵循大脑的进化发展,并认为需要有一些基础,比如大量高度互连的神经元的集合,智能(无论如何定义)才能从中产生。
Proponents of symbolic AI believe that intelligence can be achieved in the abstract, without a substrate resembling a brain. Connectionists follow the evolutionary development of brains and argue that there needs to be some foundation, like a massive collection of highly interconnected neurons, from which intelligence (however defined) can emerge.
虽然符号人工智能和联结主义之间的争论由来已久,但随着深度学习的出现,可以肯定地说联结主义赢得了胜利——尽管也许不是战争。近年来出现了一些混合这两种方法的论文。我怀疑象征性人工智能即使最终没有扮演配角,至少还会有一两个客串。
While the debate between symbolic AI and connectionism was long-lived, with the advent of deep learning it’s safe to say that the connectionists have won the day—though perhaps not the war. Recent years have seen a smattering of papers blending the two approaches. I suspect symbolic AI has a cameo or two left in it, if not ultimately starring in a supporting role.
我在 20 世纪 80 年代末对人工智能的介绍完全是象征性的。联结主义被认为是另一种方法,但神经网络被认为较差,而且充其量也没什么用处。
My introduction to AI in the late 1980s was entirely symbolic. Connectionism was mentioned as another approach, but neural networks were thought inferior and likely to be marginally useful at best.
人工智能的完整历史超出了我们的范围。这样一部巨著正等待着一位有动力、有能力的历史学家。相反,我将专注于机器学习的发展,同时(非常不公平!)忽略符号阵营中的人们几十年来付出的巨大努力。然而,要知道,在人工智能历史的大部分时间里,人们谈论的大多是符号人工智能,而不是联结主义。为了更公平地进行演示,我推荐迈克尔·伍德里奇(Michael Wooldridge)的书《人工智能简史》(Flatiron Books,2021 年),或帕梅拉·麦考达克(Pamela McCorduck)在《这可能很重要:我与人工智能的生活和时代》(Lulu Press,2019 年)中的深度个人叙述。
A complete history of artificial intelligence is beyond our scope. Such a magnum opus awaits a motivated and capable historian. Instead, I’ll focus on the development of machine learning while (very unfairly!) ignoring the mountain of effort expended over the decades by those in the symbolic camp. Know, however, that for most of AI’s history, people mostly spoke of symbolic AI, not connectionism. For a fairer presentation, I recommend Michael Wooldridge’s book A Brief History of Artificial Intelligence (Flatiron Books, 2021), or Pamela McCorduck’s deeply personal account in This Could Be Important: My Life and Times with the Artificial Intelligentsia (Lulu Press, 2019).
考虑到我明显的联结主义偏见,让我们回顾一下机器学习的历史。
With my apparent connectionist bias in mind, let’s take a stroll through the history of machine learning.
智能机器的梦想可以追溯到古代。古希腊人讲述了塔洛斯的神话,这是一个守护腓尼基公主欧罗巴的巨型机器人。在整个中世纪和文艺复兴时期,人们开发了许多自动机——可以移动且看起来栩栩如生的机器。然而,我怀疑没有人被认为是聪明的或有思考能力的。有些甚至是恶作剧,比如臭名昭著的“机械土耳其人”,它通过与许多熟练的国际象棋棋手下棋并击败了世界,让世界惊叹不已。在里面最后,人们发现隐藏在机器内的人可以通过操纵机械臂移动棋盘上的独立棋子来控制“自动机”,同时从下面查看棋盘配置。尽管如此,该机器的机械部分对于 18 世纪末来说还是相当令人印象深刻的。
The dream of intelligent machines dates back to antiquity. Ancient Greeks related the myth of Talos, a giant robot meant to guard the Phoenician princess, Europa. Throughout the Middle Ages and Renaissance, many automatons—machines that moved and appeared lifelike—were developed. However, I suspect that none were believed to be intelligent or capable of thought. Some were even hoaxes, like the infamous Mechanical Turk that wowed the world by playing, and beating, many skilled chess players. In the end, it was discovered that a person hiding within the machine could control the “automaton” by manipulating a mechanical arm to move free-standing chess pieces on the board while viewing the board configuration from beneath. Still, the mechanical part of the machine was rather impressive for the late 18th century.
除了自动机之外,人们还早期尝试将思想理解为一个机械过程,并努力产生一个能够捕捉思想的逻辑系统。 17世纪,戈特弗里德·莱布尼茨(Gottfried Leibniz)将这样一个概念抽象地描述为“思想字母表”。 1750年代,Julien Offray de La Mettrie出版了《L'Homme Machine》(人作为机器),认为思想是一个机械过程。
Apart from automatons, there were also early attempts to understand thought as a mechanical process and efforts to produce a logical system capable of capturing thought. In the 17th century, Gottfried Leibniz described such a concept abstractly as an “alphabet of thought.” In the 1750s, Julien Offray de La Mettrie published L’Homme Machine (Man as Machine), arguing that thought is a mechanical process.
人类思想可能源自大脑的物理实体而不是精神灵魂的想法标志着人工智能之路新篇章的开始。如果我们的思想是生物机器,为什么不能有另一种能够思考的机器呢?
The idea that human thought might emerge from the physical entity of the brain rather than the spiritual soul marked the beginning of a new chapter on the road to AI. If our minds are biological machines, why can’t there be another kind of machine that thinks?
19 世纪,乔治·布尔试图创建一种思想演算,产生了我们现在所知的布尔代数。计算机依赖于布尔代数,以至于它将计算机的实现表示为数字逻辑门的集合。布尔取得了部分成功,但他没有实现他既定的目标:“研究进行推理的思维运作的基本规律;用微积分的符号语言表达它们”(思想法则,1854)。布尔愿意尝试,代表着人工智能可能成为可能的理念又向前迈进了一步。
In the 19th century, George Boole attempted to create a calculus of thought, resulting in what we know now as Boolean algebra. Computers depend on Boolean algebra, to the point that it represents their very implementation as collections of digital logic gates. Boole was partially successful, but he didn’t achieve his stated goal: “to investigate the fundamental laws of those operations of the mind by which reasoning is performed; to give expression to them in the symbolic language of a Calculus” (The Laws of Thought, 1854). That Boole was willing to try represented another step toward the notion that AI might be possible.
这些早期的尝试所缺乏的是一台真正的计算机器。人们可以梦想人造思想或生物(如玛丽·雪莱的《弗兰肯斯坦》中的生物),并假设它们的存在,讨论其影响。但在出现一台能够合理地模仿(实现?)思想的机器之前,其他一切都只是猜测。
What these early attempts were lacking was an actual calculating machine. People could dream of artificial minds or beings (like the creature from Mary Shelley’s Frankenstein) and, assuming their existence, discuss the repercussions. But until there was a machine capable of plausibly mimicking (implementing?) thought, all else was speculation.
英国人查尔斯·巴贝奇 (Charles Babbage) 在 19 世纪中叶首先构想出一种可实现的通用计算机:分析机。该引擎从未完整构建,但它包含现代计算机的所有基本组件,并且理论上能够执行相同的操作。虽然尚不清楚巴贝奇是否欣赏他的机器的潜在多功能性,但他的朋友艾达·洛夫莱斯却欣赏了。她将这台机器描述为一种广泛适用的通用设备。尽管如此,她仍然不相信引擎能够思考,正如她的《分析引擎草图》(1843 年)中的这句话所表明的那样:
It was Englishman Charles Babbage who, in the mid-19th century, first conceived of an implementable general-purpose calculating machine: the Analytical Engine. The Engine was never built in its entirety, but it contained all the essential components of a modern computer and would, in theory, be capable of the same operations. While it’s unclear if Babbage appreciated the potential versatility of his machine, his friend, Ada Lovelace, did. She wrote about the machine as a widely applicable, general-purpose device. Still, she did not believe the Engine was capable of thought, as this quote from her Sketch of the Analytical Engine (1843) demonstrates:
分析引擎没有任何意图来创造任何东西。它可以做任何我们知道如何命令它执行的事情。它可以遵循分析;但它没有能力预测任何分析关系或真理。它的职责是帮助我们提供我们已经熟悉的东西。
The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths. Its province is to assist us in making available what we are already acquainted with.
这句话可能是第一个提到人工智能涉及可能实现人工智能的设备的可能性。 “做我们知道如何命令它执行的任何事情”这句话意味着编程。事实上,洛夫莱斯为分析引擎编写了一个程序。因为这,许多人认为她是第一个计算机程序员。她的程序有一个错误,这一事实向我证明了她确实是这样;没有什么比 bug 更能体现编程的意义了,我 40 多年的编程经验经常令人痛苦地证明这一点。
This quote may be the first to refer to the possibility of artificial intelligence involving a device potentially capable of achieving it. The phrase “do whatever we know how to order it to perform” implies programming. Indeed, Lovelace wrote a program for the Analytical Engine. Because of this, many people consider her to be the first computer programmer. The fact that her program had a bug in it proves to me that she was; nothing is more emblematic of programming than bugs, as my 40-plus years of programming experience have demonstrated distressingly often.
1936 年,24 岁的英国人艾伦·图灵(当时还是一名学生)写了一篇论文,该论文后来成为计算机科学的基石。在这篇论文中,图灵介绍了一种通用概念机,我们现在称之为图灵机,并证明它可以计算任何可以用算法表示的东西。他还解释说,有些事情无法通过算法实现,因此是不可计算的。由于所有现代编程语言都相当于图灵机,因此现代计算机可以实现任何算法并计算任何可计算的东西。然而,这并没有说明计算可能需要多长时间或所需的内存。
In 1936, a 24-year-old Englishman named Alan Turing, still a student at the time, wrote a paper that has since become the cornerstone of computer science. In this paper, Turing introduced a generic conceptual machine, what we now call a Turing machine, and demonstrated that it could calculate anything representable by an algorithm. He also explained that there are things that cannot be implemented by algorithms and that are, therefore, uncomputable. Since all modern programming languages are equivalent to a Turing machine, modern computers can implement any algorithm and compute anything computable. However, this says nothing about how long the computation might take or the memory required.
如果计算机可以计算任何可以作为算法实现的东西,那么计算机就可以执行人类可以执行的任何心理操作。最后,这就是可能实现真正人工智能的引擎。图灵 1950 年发表的论文《计算机器与智能》是对数字计算机最终可能导致智能机器的早期认识。在这篇论文中,图灵描述了他的“模仿游戏”,现在被称为图灵测试,通过它,人类可能会相信机器是智能的。特别是近年来,出现了许多关于人工智能系统通过图灵测试的说法。其中之一是 OpenAI 的 ChatGPT。然而,很少有人愿意相信 ChatGPT 是真正智能的——换句话说,我怀疑这个测试未能捕捉到人类通常理解这个术语的含义,并且可能会在某个时候创建一个新的测试。
If a computer can compute anything that can be implemented as an algorithm, then a computer can perform any mental operation a human can perform. At last, here was the engine that might enable true artificial intelligence. Turing’s 1950 paper “Computing Machinery and Intelligence” was an early recognition that digital computers might eventually lead to intelligent machines. In this paper, Turing described his “imitation game,” known now as the Turing test, by which humans might come to believe that a machine is intelligent. Many claims of AI systems that pass the Turing test have appeared, especially in recent years. One of these is OpenAI’s ChatGPT. However, few would be inclined to believe that ChatGPT is truly intelligent—in other words, I suspect that this test fails to capture what humans generally understand this term to mean, and a new test will likely be created at some point.
1943 年,沃伦·麦卡洛克 (Warren McCulloch) 和沃尔特·皮茨 (Walter Pitts) 撰写了《神经活动中内在思想的逻辑演算》(A Logical Calculus of Ideas Immanent in Nervous Activity),该书被誉为有史以来最不透明但最有趣的论文标题之一,值得获奖。该论文将“神经网络”(神经元的集合)表示为数学中的逻辑陈述。这些逻辑陈述很难解析(至少对我来说),但作者对“没有圆圈的网络”的描述与我们将在第 4 章中探讨的神经网络非常相似——事实上,人们可能会认为麦卡洛克和皮茨的开创性的论文催生了我们现在所认识的神经网络。坦率地说,神经网络更容易解析和理解,这对我们来说是个好消息。
In 1943, Warren McCulloch and Walter Pitts wrote “A Logical Calculus of Ideas Immanent in Nervous Activity,” which deserves an award for one of the most opaque yet intriguing paper titles ever. The paper represents “nervous nets” (collections of neurons) as logical statements in mathematics. The logical statements are difficult to parse (at least for me), but the authors’ description of “nets without circles” bears a strong resemblance to the neural networks we’ll explore in Chapter 4—indeed, one could argue that McCulloch and Pitts’s groundbreaking paper led to what we now recognize as a neural network. Frankly, neural networks are far easier to parse and understand, which is good news for us.
从关于人工智能机器和生物的奇幻故事,到对数学是否能够捕捉思想和推理的认真调查,再加上数字计算机能够计算任何可以用算法描述的东西的认识,为人工智能的出现奠定了基础人工智能作为合法的研究企业。
The progression from fantastical stories about artificially intelligent machines and beings to a serious investigation of whether mathematics can capture thought and reasoning, combined with the realization that digital computers are capable of computing anything that can be described by an algorithm, set the stage for the advent of artificial intelligence as a legitimate research enterprise.
1956年达特茅斯人工智能夏季研究项目研讨会被普遍认为是人工智能的诞生地,也是“人工智能”一词首次被一致使用的地方。达特茅斯研讨会的参与者不到 50 人,但名单中包括计算机科学和数学领域的几位知名人士:雷·所罗门诺夫 (Ray Solomonoff)、约翰·麦卡锡 (John McCarthy)、马文·明斯基 (Marvin Minsky)、克劳德·香农 (Claude Shannon)、约翰·纳什 (John Nash) 和沃伦·麦卡洛克 (Warren McCulloch) 等。当时,计算机科学是数学的一个子领域。该研讨会是一次集思广益的会议,为早期人工智能研究奠定了基础。
The 1956 Dartmouth Summer Research Project on Artificial Intelligence workshop is generally regarded as the birthplace of AI, and where the phrase “artificial intelligence” was first used consistently. The Dartmouth workshop had fewer than 50 participants, but the list included several well-known names in the worlds of computer science and mathematics: Ray Solomonoff, John McCarthy, Marvin Minsky, Claude Shannon, John Nash, and Warren McCulloch, among others. At the time, computer science was a subfield of mathematics. The workshop was a brainstorming session that set the stage for early AI research.
1957 年,康奈尔大学的 Frank Rosenblatt 创建了 Mark I 感知器,被广泛认为是神经网络的第一个应用。感知器在许多方面都很出色,包括它是为图像识别而设计的,深度学习在 2012 年首次证明了自己的应用。
In 1957, Frank Rosenblatt of Cornell University created the Mark I Perceptron, widely recognized as the first application of neural networks. The Perceptron was remarkable in many respects, including that it was designed for image recognition, the same application where deep learning first proved itself in 2012.
图 2-1显示了感知器操作员手册中给出的概念组织。感知器使用 20×20 像素的数字化电视图像作为输入,然后通过一组“随机”连接传递到一组关联单元,从而产生响应单元。这种配置类似于当今使用的一些图像深度学习方法,并且类似于一种称为极限学习机的神经网络。
Figure 2-1 shows the conceptual organization as given in the Perceptron Operators’ Manual. The Perceptron used a 20×20-pixel digitized television image as input, which was then passed through a “random” set of connections to a set of association units that led to response units. This configuration is similar to some approaches to deep learning on images in use today and resembles a type of neural network known as an extreme learning machine.
图 2-1:Mark I 感知器的结构
Figure 2-1: The organization of the Mark I Perceptron
如果感知器走在正确的轨道上,为什么几十年来它几乎被遗忘了?原因之一是罗森布拉特热衷于炒作。 1958 年,在美国海军(感知机项目的赞助商)组织的一次会议上,罗森布拉特的评论过于夸张,以至于《纽约时报》报道称:
If the Perceptron was on the right track, why was it all but forgotten for decades? One reason was Rosenblatt’s penchant for hype. At a 1958 conference organized by the US Navy (a sponsor of the Perceptron project), Rosenblatt’s comments were so hyperbolic that the New York Times reported:
海军今天展示了电子计算机的雏形,它预计将能够行走、说话、看、写、自我复制并意识到自己的存在。据预测,后来的感知器将能够识别人们并喊出他们的名字,并立即将一种语言的语音翻译成另一种语言的语音和书写。
The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. Later perceptrons will be able to recognize people and call out their names and instantly translate speech in one language to speech and writing in another language, it was predicted.
这些评论在当时激怒了许多人,尽管现代人工智能系统确实允许机器行走、说话、看、写、识别人,并在不同语言之间翻译语音和写作,也许我们应该对罗森布拉特更加宽容。他只比他早了大约60岁。
The comments ruffled many feathers at the time, though as modern AI systems do allow machines to walk, talk, see, write, recognize people, and translate speech and writing between languages, perhaps we should be more forgiving toward Rosenblatt. He was only some 60 years early.
几年后,即 1963 年,Leonard Uhr 和 Charles Vossler 描述了一个程序,该程序可以像感知器一样解释由 0 和 1 组成的矩阵表示的 20×20 像素图像。与感知器不同,该程序能够生成学习其输入所需的图像特征的模式和组合。 Uhr 和 Vossler 的程序与 30 多年后出现的卷积神经网络类似,也是第 5 章的主题。
A few years later, in 1963, Leonard Uhr and Charles Vossler described a program that, like the Perceptron, interpreted a 20×20-pixel image represented as a matrix of 0s and 1s. Unlike the Perceptron, this program was able to generate the patterns and combinations of image features necessary to learn its inputs. Uhr and Vossler’s program was similar to the convolutional neural networks that appeared over 30 years later and are the subject of Chapter 5.
我所说的第一个“经典”机器学习模型出现于 1967 年,由 Thomas Cover 和 Peter Hart 提供。它被称为最近邻居,是所有机器学习模型中最简单的,几乎令人尴尬。要标记未知输入,它只需找到最相似的已知输入,然后使用该输入的标签作为输出。当使用多个附近的已知输入时,该方法称为k-近邻,其中k是一个小数,例如 3 或 5。Hart与 Richard Duda 和 David Stork 一起编写了第一版模式分类, 1973年;这项开创性的工作向许多计算机科学家和软件工程师介绍了机器学习,包括我。
The first of what I call the “classical” machine learning models appeared in 1967, courtesy of Thomas Cover and Peter Hart. Known as nearest neighbors, it is the simplest of all machine learning models, almost embarrassingly so. To label an unknown input, it simply finds the known input most like it and uses that input’s label as the output. When using more than one nearby known input, the method is called k-nearest neighbors, where k is a small number, like 3 or 5. Hart went on to write the first edition of Pattern Classification, along with Richard Duda and David Stork, in 1973; this seminal work introduced many computer scientists and software engineers to machine learning, including me.
感知器的成功在 1969 年戛然而止,当时 Marvin Minsky 和 Seymour Papert 出版了他们的书《感知器》,该书证明单层和双层感知器网络无法对有趣的任务进行建模。我们将及时介绍“单层”和“双层”的含义。感知器,加上 1973 年詹姆斯·莱特希尔 (James Lighthill) 发布的《人工智能:综合调查》(俗称“莱特希尔报告”),迎来了现在所说的第一个人工智能冬天;人工智能研究的资金很快就枯竭了。
The success of the Perceptron came to a screeching halt in 1969, when Marvin Minsky and Seymour Papert published their book Perceptrons, which demonstrated that single- and two-layer perceptron networks weren’t able to model interesting tasks. We’ll cover what “single-layer” and “two-layer” mean in time. Perceptrons, coupled with the 1973 release of “Artificial Intelligence: A General Survey” by James Lighthill, universally known as “the Lighthill report,” ushered in what is now referred to as the first AI winter; funding for AI research dried up in short order.
明斯基和帕珀特对感知器模型的批评是合理的。然而,许多人忽略了他们的观察,即这种限制不适用于更复杂的感知器模型。不管怎样,损害已经造成,联结主义实际上消失了,直到 20 世纪 80 年代初。
Minsky and Papert’s criticisms of the perceptron model were legitimate; however, many people missed their observation that such limitations were not applicable to more complex perceptron models. Regardless, the damage was done, and connectionism virtually vanished until the early 1980s.
注意“实际上”。 1979 年,福岛邦彦 (Kunihiko Fukushima) 发表了一篇论文,名为“Neocognitron:一种自组织神经网络模型,用于不受位置变化影响的模式识别机制”,该论文于 1980 年被翻译成英文。 “Neocognitron”这个名字并没有流行起来,而这个也许是过去三十年来在计算机科学中如此流行的“-tron”后缀的最后使用之一。虽然 Uhr 和 Vossler 1963 年的程序与卷积神经网络有一些相似之处,但对许多人来说,Neocognitron 才是最初的程序。卷积神经网络的成功直接引发了当前的人工智能革命。
Note the “virtually.” In 1979, Kunihiko Fukushima released a paper that was translated into English in 1980 as “Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position.” The name “Neocognitron” didn’t catch on, and this was perhaps one of the last uses of the “-tron” suffix that had been so popular in computer science for the previous three decades. While Uhr and Vossler’s 1963 program bore some similarities to a convolutional neural network, the Neocognitron is, to many people, the original. The success of convolutional neural networks led directly to the current AI revolution.
20 世纪 80 年代初,随着专门设计用于运行 Lisp 编程语言(当时是人工智能的通用语言)的计算机的出现,人工智能开始商业化。 (今天,它是 Python。)随着 Lisp 机器的出现,专家系统也随之兴起——旨在捕获狭窄领域专家知识的软件。人工智能的商业化结束了第一个人工智能冬天。
In the early 1980s, AI went commercial with the advent of computers specifically designed to run the Lisp programming language, then the lingua franca of AI. (Today, it’s Python.) Along with Lisp machines came the rise of expert systems—software designed to capture the knowledge of an expert in a narrow domain. The commercialization of AI brought the first AI winter to an end.
诚然,专家系统背后的概念很诱人。例如,要构建诊断特定类型癌症的专家系统,您首先采访专家以提取他们的知识并将其排列在知识库中。知识库将知识表示为规则和事实的组合。然后,将知识库与推理引擎结合起来,推理引擎使用知识库根据存储的事实或用户对系统的输入来决定何时以及如何执行规则。规则基于事实触发,这可能会导致在知识库中放置新事实,从而导致其他规则触发,等等。专家系统的一个典型例子是 CLIPS,它是 NASA 于 1985 年开发的,并于 1996 年发布到公共领域。
The concept behind expert systems is, admittedly, seductive. To build an expert system that, for example, diagnoses a particular kind of cancer, you first interview experts to extract their knowledge and arrange it in a knowledge base. A knowledge base represents knowledge as a combination of rules and facts. Then, you combine the knowledge base with an inference engine, which uses the knowledge base to decide when and how to execute rules based on stored facts or input to the system by a user. Rules fire based on facts, which may lead to placing new facts in the knowledge base that cause additional rules to fire, and so on. A classic example of an expert system is CLIPS, which NASA developed in 1985 and released into the public domain in 1996.
在专家系统中,不存在可以(希望)引起智能行为出现的联结网络或单元集合,这使其成为符号人工智能的一个很好的例子。相反,知识库本质上是严格的规则集合,例如“如果发动机温度高于此阈值,则其他原因可能是原因”,以及事实,例如“发动机温度低于阈值”。知识工程师是专家与专家系统之间的纽带。根据专家对知识工程师提出的问题的回答构建知识库非常复杂,而且生成的知识库很难随着时间的推移而修改。然而,设计专家系统的困难并不意味着它们没有用;而是意味着它们没有用处。它们仍然存在,主要以“业务规则管理系统”为幌子,但目前对现代人工智能的影响微乎其微。
In an expert system, there’s no connectionist network or collection of units from which one might (hopefully) cause intelligent behavior to emerge, making it a good example of symbolic AI. Instead, the knowledge base is an essentially rigid collection of rules, like “if the engine temperature is above this threshold, then this other thing is the likely cause,” and facts, like “the engine temperature is below the threshold.” Knowledge engineers are the links between the experts and the expert system. Building a knowledge base from the experts’ answers to the questions posed by the knowledge engineers is complex, and the resulting knowledge base is hard to modify over time. However, the difficulty in designing expert systems doesn’t mean they’re useless; they still exist, mainly under the guise of “business rule management systems,” but currently have minimal impact on modern AI.
围绕专家系统的炒作加上早期的成功,在 20 世纪 80 年代初重新燃起了人们对人工智能的兴趣。但当人们发现专家系统太脆弱而无法普遍使用时,该行业就跌入谷底,人工智能的第二个冬天在本世纪中期来临。
The hype surrounding expert systems, combined with early successes, drove renewed interest in AI in the early 1980s. But when it became clear that expert systems were too brittle to have a general use, the bottom fell out of the industry, and AI’s second winter hit in the middle of the decade.
20 世纪 80 年代,联结主义者占据了幕后地位,但他们并没有坐以待毙。 1982 年,约翰·霍普菲尔德 (John Hopfield) 演示了现在称为霍普菲尔德网络的网络。Hopfield 网络是一种神经网络,它在网络权重内以分布式方式存储信息,然后在稍后提取该信息。 Hopfield 网络在现代深度学习中并未广泛使用,但它们证明了联结主义方法的实用性的重要证明。
During the 1980s, connectionists occupied the background, but they were not sitting still. In 1982, John Hopfield demonstrated what are now known as Hopfield networks. A Hopfield network is a type of neural network that stores information in a distributed way within the weights of the network, and then extracts that information at a later time. Hopfield networks aren’t widely used in modern deep learning, but they proved an important demonstration of the utility of the connectionist approach.
1986 年,David Rumelhart、Geoffrey Hinton 和 Ronald Williams 发表了论文“通过反向传播误差学习表示”,概述了用于训练神经网络的反向传播算法。训练神经网络涉及调整神经元之间的权重,以便网络按需要运行。反向传播算法通过计算调整特定权重如何影响网络的整体性能,是使该过程高效的关键。有了这些信息,就可以通过应用已知的训练数据来迭代训练网络,然后在分类时使用网络的误差来调整权重,以迫使网络在下一次迭代中表现更好。 (我将在第 4 章中更深入地讨论神经网络训练。)通过反向传播,神经网络可以远远超越 Rosenblatt 感知器的有限性能。然而,即使有了反向传播,20 世纪 80 年代的神经网络也只不过是玩具。虽然关于谁发明反向传播以及何时发明存在争议,但 1986 年的论文通常被认为是对神经网络研究人员影响最大的演讲。
In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams released their paper “Learning Representations by Back-propagating Errors,” which outlined the backpropagation algorithm for training neural networks. Training a neural network involves adjusting the weights between the neurons so that the network operates as desired. The backpropagation algorithm was the key to making this process efficient by calculating how adjusting a particular weight affects the network’s overall performance. With this information, it becomes possible to iteratively train the network by applying known training data, then using the network’s errors when classifying to adjust the weights to force the network to perform better on the next iteration. (I’ll discuss neural network training in more depth in Chapter 4.) With backpropagation, neural networks could go well beyond the limited performance of Rosenblatt’s Perceptron. However, even with backpropagation, neural networks in the 1980s were little more than toys. While there’s contention about who invented backpropagation and when, the 1986 paper is generally understood to be the presentation that influenced neural network researchers the most.
第二个人工智能冬天延续到了 20 世纪 90 年代,但象征主义和联结主义阵营的研究仍在继续。 Corinna Cortes 和 Vladimir Vapnik 于 1995 年引入了支持向量机 (SVM) 的机器学习社区。从某种意义上说,SVM 代表了经典机器学习的最高水平。 SVM 在 20 世纪 90 年代到 2000 年代初的成功让神经网络陷入了困境。神经网络需要大量数据集和强大的计算能力;另一方面,SVM 通常对资源的要求较低。神经网络的力量来自于网络表示函数的能力,即从输入到所需输出的映射,而支持向量机则使用巧妙的数学来简化困难的分类问题。
The second AI winter extended into the 1990s, but research continued in both the symbolic and connectionist camps. Corinna Cortes and Vladimir Vapnik introduced the machine learning community to support vector machines (SVMs) in 1995. In a sense, SVMs represent the high-water mark of classical machine learning. The success of SVMs in the 1990s through the early 2000s held neural networks at bay. Neural networks require large datasets and significant computational power; SVMs, on the other hand, are often less demanding of resources. Neural networks gain their power from the network’s ability to represent a function, a mapping from inputs to the desired outputs, while SVMs use clever mathematics to simplify difficult classification problems.
SVM 的成功引起了学术界以及更广泛的软件工程领域的关注,其中涉及机器学习的应用正在不断增加。尽管智能机器继续频繁出现在科幻小说中,但公众基本上没有意识到这些进步。
The success of SVMs was noted in the academic community as well as the broader world of software engineering, where applications involving machine learning were increasing. The general public was largely unaware of these advances, though intelligent machines continued appearing frequently in science fiction.
1997 年,随着 IBM 的深蓝超级计算机战胜当时的国际象棋世界冠军加里·卡斯帕罗夫,这个人工智能冬天结束了。当时,很少有人认为机器可以击败最好的人类国际象棋棋手。有趣的是,十年前,我的一位教授曾预测人工智能将在 2000 年之前完成这一壮举。这位教授有千里眼吗?并不真地。深蓝将快速定制硬件与复杂的软件相结合,并应用已知的人工智能搜索算法(特别是 Minimax 算法)。结合启发式方法和来自其他国际象棋大师的大量习惯知识,深蓝能够通过搜索比任何人类所能想象的更多的可能走法来超越人类对手。无论如何,从本质上讲,深蓝实现了人工智能专家所知道的,如果机器具备能力,就可以击败人类。有足够的资源可供支配。深蓝的胜利是不可避免的,因为研究人员预计计算机最终会变得足够快以超越人类的能力。需要什么是已知的;剩下的就是将各个部分拼凑起来。
This AI winter ended in 1997 with the victory of IBM’s Deep Blue supercomputer against then world chess champion Garry Kasparov. At the time, few people thought a machine could ever beat the best human chess player. Interestingly, a decade earlier, one of my professors had predicted that an AI would accomplish this feat before the year 2000. Was this professor clairvoyant? Not really. Deep Blue combined fast custom hardware with sophisticated software and applied known AI search algorithms (in particular, the Minimax algorithm). Combined with heuristics and a healthy dose of custom knowledge from other chess grandmasters, Deep Blue was able to out-evaluate its human opponent by searching more possible moves than any human could ever hope to contemplate. Regardless, at its core, Deep Blue implemented what AI experts knew could beat a human if the machine had enough resources at its disposal. Deep Blue’s victory was inevitable because researchers expected computers to eventually become fast enough to overcome a human’s abilities. What was needed was known; all that remained was to put the pieces together.
1998 年,Yann LeCun、Léon Bottou、Yoshua Bengio 和 Patrick Haffner 发表了一篇论文《应用于文档识别的基于梯度的学习》,该论文没有引起公众的注意,但却是人工智能和世界的分水岭时刻。虽然福岛的 Neocognitron 与引发现代人工智能革命的卷积神经网络非常相似,但本文直接介绍了它们,以及我们在第 1 章中使用的(不名)著名的 MNIST 数据集。 1998 年卷积神经网络 (CNN) 的出现引出了一个问题:为什么又过了 14 年才引起世界的注意?我们将在本章后面回到这个问题。
The year 1998 saw the publication of “Gradient-Based Learning Applied to Document Recognition,” a paper by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner that escaped public notice but was a watershed moment for AI and the world. While Fukushima’s Neocognitron bore strong similarities to the convolutional neural networks that initiated the modern AI revolution, this paper introduced them directly, as well as the (in)famous MNIST dataset we used in Chapter 1. The advent of convolutional neural networks (CNNs) in 1998 begs the question: why did it take another 14 years before the world took notice? We’ll return to this question later in the chapter.
Leo Breiman 于 2001 年引入了随机森林,将随机森林算法的现有部分形成一个连贯的整体,就像达尔文在 19 世纪对进化论所做的那样。随机森林是我们将在第 3 章中考虑的最后一种经典机器学习算法。如果“随机森林”让您想起第 1 章中的决策树,这是有原因的:随机森林是决策树的森林。
Leo Breiman introduced random forests in 2001 by forming the existing pieces of what would become the random forest algorithm into a coherent whole, much like Darwin did with evolution in the 19th century. Random forests are the last of the classical machine learning algorithms we’ll contemplate in Chapter 3. If “random forests” remind you of the decision trees in Chapter 1, there’s a reason: a random forest is a forest of decision trees.
堆叠式去噪自动编码器是一种中间模型,是我在 2010 年接触深度学习的入门模型。自动编码器是一种神经网络,它在生成输出之前将其输入传递到中间层。它的目的是从中间层输入的编码形式再现其输入。
Stacked denoising autoencoders are one type of intermediate model, and they were my introduction to deep learning in 2010. An autoencoder is a neural network that passes its input through a middle layer before generating output. It aims to reproduce its input from the encoded form of the input in the middle layer.
自动编码器可能看起来是一件愚蠢的事情,但是在学习重现其输入时,中间层通常会学习一些关于输入的有趣的东西,这些输入捕获了它们的本质,而不关注精细的、琐碎的细节。例如,如果输入是 MNIST 数字,则自动编码器的中间层学习数字而不是字母。
An autoencoder may seem like a silly thing to fiddle with, but while learning to reproduce its input, the middle layer typically learns something interesting about the inputs that captures their essence without focusing on fine, trivial details. For example, if the inputs are the MNIST digits, then the middle layer of an autoencoder learns about digits as opposed to letters.
去噪自动编码器类似,但我们在将输入推入中间层之前丢弃输入值的随机部分。自动编码器仍然必须学习重现整个输入,但现在它的任务更具挑战性,因为输入不完整。此过程有助于自动编码器的中间层发现更好的输入编码。
A denoising autoencoder is similar, but we discard a random fraction of the input values before pushing the input through the middle layer. The autoencoder must still learn to reproduce the entire input, but now it has a more challenging task because the input is incomplete. This process helps the autoencoder’s middle layer discover a better encoding of the input.
最后,堆叠式去噪自动编码器是一堆去噪自动编码器,其中一个中间层的输出成为下一个的输入。当以这种方式排列时,堆栈会学习输入的新表示,这通常有助于附加到堆栈顶部的分类器来区分类别。例如,在我当时的工作中,输入是可能包含感兴趣目标的图像的小片段。使用两层或三层经过训练的堆叠式去噪自动编码器将输入转换为数字列表,这些数字有望代表输入的本质,同时忽略图像的细节。输出然后与支持向量机一起使用来确定输入是否是目标。
Finally, a stacked denoising autoencoder is a stack of denoising autoencoders, wherein the output of the middle layer of one becomes the input of the next. When arranged this way, the stack learns a new representation of the input, which often helps a classifier appended to the top of the stack to discriminate between classes. For example, in my work at the time, the inputs were small pieces of an image that may have contained a target of interest. Two or three layers of trained stacked denoising autoencoders were used to transform the inputs into a list of numbers that would hopefully represent the input’s essence while ignoring the image’s minutiae. The outputs were then used with a support vector machine to decide if the input was a target.
2012 年,深度学习引起了全世界的关注,当时 AlexNet(一种特殊的卷积神经网络架构)以略高于 15% 的总体错误率赢得了 ImageNet 挑战赛,远低于任何竞争对手。 ImageNet 挑战要求模型识别彩色图像的主要主题,无论是狗、猫、割草机等。事实上,“狗”并不是一个足够的答案。 ImageNet 数据集包含 1,000 类对象,其中包括约 120 种不同的狗品种。因此,正确的答案是“它是边境牧羊犬”或“它是比利时玛利诺犬”。
Deep learning caught the world’s attention in 2012 when AlexNet, a particular convolutional neural network architecture, won the ImageNet challenge with an overall error of just over 15 percent—far lower than any competitor. The ImageNet challenge asks models to identify the main subject of color images, whether a dog, a cat, a lawnmower, and so on. In reality, “dog” isn’t a sufficient answer. The ImageNet dataset contains 1,000 classes of objects, including some 120 different dog breeds. So, a correct answer would be “it’s a Border Collie” or “it’s a Belgian Malinois.”
随机猜测意味着为每个图像随机分配一个类标签。在这种情况下,我们预计总体成功率为千分之一,即错误率为 99.9%。 AlexNet 15% 的错误率确实令人印象深刻,那是在 2012 年。到 2017 年,卷积神经网络已将错误率降低至 3% 左右,低于少数勇敢地手动完成挑战的人所能达到的大约 5% 的错误率。你能区分 120 种不同的狗品种吗?我当然不能。
Random guessing means randomly assigning a class label to each image. In that case, we would expect an overall success rate of 1 in 1,000, or an error rate of 99.9 percent. AlexNet’s error of 15 percent was truly impressive—and that was in 2012. By 2017, convolutional neural networks had reduced the error to about 3 percent, below the approximate 5 percent achievable by the few humans brave enough to do the challenge manually. Can you discriminate between 120 different dog breeds? I certainly can’t.
AlexNet 打开了闸门。新模型打破了之前的所有记录,并开始完成人们真正期望的任务:例如以另一幅图像或绘画的风格重新想象图像,生成图像内容的文本描述以及所显示的活动,或者玩电子游戏的效果与人类一样好,甚至比人类更好。
AlexNet opened the floodgates. The new models broke all previous records and began to accomplish what no one had really expected from them: tasks like reimagining images in the style of another image or painting, generating a text description of the contents of an image along with the activity shown, or playing video games as well as or better than a human, among others.
该领域的发展速度如此之快,以至于几乎不可能跟上每天新论文的洪流。保持最新状态的唯一方法是每年参加多个会议,并回顾 arXiv ( https://www.arxiv.org ) 等网站上出现的新作品,该网站首次发表了许多领域的研究成果。这导致了像https://www.arxiv-sanity-lite.com这样的网站的创建,该网站根据读者的兴趣对机器学习论文进行排名,希望“最好的”论文可以变得更容易找到。
The field was proliferating so quickly that it became nearly impossible to keep up with each day’s deluge of new papers. The only way to stay current was to attend multiple conferences per year and review the new work appearing on websites such as arXiv (https://www.arxiv.org), where research in many fields is first published. This led to the creation of sites like https://www.arxiv-sanity-lite.com, which ranks machine learning papers according to reader interest in the hope that the “best” might become easier to find.
2014 年,研究人员伊恩·古德费洛 (Ian Goodfellow) 在晚上与朋友的谈话中获得的见解,又出现了一项突破。结果是生成对抗网络 (GAN)的诞生,Yann LeCun 当时称其为 20 至 30 年来神经网络领域最重大的突破(在 NeurIPS 2016 上无意中听到)。我们将在第 6 章中讨论 GAN,它开辟了一个新的研究领域,让模型“创建”与其训练数据相关但又不同的输出。 GAN 导致了当前生成式 AI的爆炸式增长,包括 ChatGPT 和 Stable Diffusion 等系统。
In 2014, another breakthrough appeared on the scene, courtesy of researcher Ian Goodfellow’s insight during an evening’s conversation with friends. The result was the birth of generative adversarial networks (GANs), which Yann LeCun called at the time the most significant breakthrough in neural networks in 20 to 30 years (overheard at NeurIPS 2016). GANs, which we’ll discuss in Chapter 6, opened a new area of research that lets models “create” output that’s related to but different from the data on which they were trained. GANs led to the current explosion of generative AI, including systems like ChatGPT and Stable Diffusion.
强化学习是机器学习的三个主要分支之一,另外两个是我们一直在讨论的监督学习和无监督学习,后者试图在没有标记数据集的情况下训练模型。在强化学习中,代理(模型)通过奖励函数学习如何完成任务。机器人技术的应用是显而易见的。
Reinforcement learning is one of the three main branches of machine learning, the other two being the supervised learning we’ve been discussing and unsupervised learning, which attempts to train models without labeled datasets. In reinforcement learning, an agent (a model) is taught via a reward function how to accomplish a task. The application to robotics is obvious.
谷歌的 DeepMind 小组于 2013 年推出了一种基于深度强化学习的系统,该系统可以成功学习玩 Atari 2600 视频游戏,并且与人类专家一样甚至更好。 (我不确定谁算得上是当时 35 岁游戏系统的专家。)对我来说,该系统最令人印象深刻的部分是模型的输入正是人类的输入:一张图像屏幕,仅此而已。这意味着系统必须学习如何解析输入图像,并据此学习如何通过移动操纵杆来赢得游戏(实际上,他们使用模拟器)。
Google’s DeepMind group introduced a deep reinforcement learning–based system in 2013 that could successfully learn to play Atari 2600 video games as well as or better than human experts. (Who counts as an expert in a then-35-year-old game system, I’m not sure.) The most impressive part of the system, to me, was that the model’s input was precisely the human’s input: an image of the screen, nothing more. This meant the system had to learn how to parse the input image and, from that, how to respond by moving the joystick to win the game (virtually—they used emulators).
从历史上看,在原始电子游戏中击败人类与在围棋等抽象策略游戏中击败人类之间存在着无法逾越的差距。在 20 世纪 80 年代末,我被明确告知深蓝等系统用来赢得国际象棋的 Minimax 算法不适用于围棋这样的游戏;因此,没有机器能够击败最优秀的人类围棋棋手。我的教授们错了,尽管他们当时有充分的理由相信他们的说法。
The gap between beating humans at primitive video games and beating humans at abstract strategy games like Go was, historically, deemed insurmountable. I was explicitly taught in the late 1980s that the Minimax algorithm used by systems like Deep Blue to win at chess did not apply to a game like Go; therefore, no machine would ever beat the best human Go players. My professors were wrong, though they had every reason at the time to believe their statement.
2016年,谷歌的AlphaGo系统在五局比赛中以四比一的比分击败了围棋冠军李世石。全世界都注意到了这一点,进一步加深了人们对范式转变已经发生的认识。此时,机器学习已经取得了商业上的成功。然而,AlphaGo 的胜利对于机器学习研究人员和实践者来说绝对是令人印象深刻的。
In 2016, Google’s AlphaGo system beat Go champion Lee Sedol in a five-game match, winning four to one. The world took notice, further enhancing the growing realization that a paradigm shift had occurred. By this time, machine learning was already a commercial success. However, AlphaGo’s victory was utterly impressive for machine learning researchers and practitioners.
大多数公众没有注意到,经过数千场人类围棋比赛训练的 AlphaGo 在 2017 年被 AlphaGo Zero 取代,AlphaGo Zero 是一个完全通过与自己对弈从头开始训练的系统,没有任何人类输入。很快,AlphaGo Zero 就掌握了围棋,甚至击败了最初的 AlphaGo 系统(取得了完美的 100 胜,零负)。
Most of the general public didn’t notice that AlphaGo, trained on thousands of human-played Go games, was replaced in 2017 by AlphaGo Zero, a system trained entirely from scratch by playing against itself, with no human input given. In short order, AlphaGo Zero mastered Go, even beating the original AlphaGo system (scoring a perfect 100 wins and no losses).
然而,在 2022 年,当前最先进的围棋系统 KataGo 多次被一个被训练成不是为了获胜而是为了揭示现代人工智能系统固有的脆弱性的系统轻松击败。对抗系统使用的动作超出了 KataGo 训练时遇到的范围。这是一个现实世界的例子,说明模型如何擅长插值但不擅长外推。当对抗系统被训练不是为了更好地围棋,而是为了利用和“挫败”人工智能时,它能够赢得比四场比赛中的三场更好的比赛。我向读者介绍了《星际迷航:下一代》中的“巅峰表现”一集,其中机器人数据“赢得”了一场与高手的艰难策略游戏,不是通过试图取胜,而是通过尝试匹配和挫败。
However, in 2022, the current state-of-the-art Go system, KataGo, was repeatedly and easily defeated by a system trained not to win but to reveal the brittleness inherent in modern AI systems. The moves the adversarial system used were outside the range encountered by KataGo when it was trained. This is a real-world example of how models are good at interpolating but bad at extrapolating. When the adversarial system was trained not to be better at Go but to exploit and “frustrate” the AI, it was able to win better than three out of four games. I point the reader to the Star Trek: The Next Generation episode “Peak Performance,” where Data the android “wins” a difficult strategy game against a master not by attempting to win but by attempting to match and frustrate.
深度学习在电子游戏中击败人类的倾向仍在继续。深度强化学习系统现在已经取代了 Atari 等原始游戏,在难度大得多的游戏中实现了大师级的表现。 2019 年,DeepMind 的 AlphaStar 系统在《星际争霸 II》(一款需要开发单位和制定作战计划的策略游戏)中表现优于 99.8% 的人类玩家。
Deep learning’s penchant for beating humans at video games continues. In place of primitive games like Atari’s, deep reinforcement learning systems are now achieving grandmaster-level performance at far more difficult games. In 2019, DeepMind’s AlphaStar system outperformed 99.8 percent of human players in StarCraft II, a strategy game requiring the development of units and a plan of battle.
1975 年阿西洛玛重组 DNA 会议是认识生物技术发展和潜在伦理问题的一个重要里程碑。这次会议对未来的研究产生了积极影响,当年其组织者发表了一份总结文件,概述了研究的伦理方法生物技术。当时还处于起步阶段的领域的潜在危险很早就被认识到,并采取行动确保在考虑未来研究时道德问题至关重要。
The 1975 Asilomar Conference on Recombinant DNA was an important milestone in recognizing biotechnology’s growth and potential ethical issues. The conference positively impacted future research, and that year its organizers published a summary paper outlining an ethical approach to biotechnology. The potential hazards of a field that was then primarily in its infancy were recognized early, and action was taken to ensure ethical issues were paramount when contemplating future research.
2017 年阿西洛玛有益人工智能会议有意模仿之前的会议,以提高人们对人工智能相关潜在危害的认识。现在,经常会遇到标题为“AI for Good”的会议。 2017 年阿西洛玛会议制定了一套指导人工智能发展和应用的原则。同样,截至 2023 年,美国政府(具体来说是白宫科技政策办公室)制定了“人工智能权利法案蓝图”,旨在保护美国公众免受滥用人工智能的有害影响。事实上,白宫官员已经煞费苦心地直接向人工智能界发表讲话,以鼓励适当考虑开发更强大的人工智能系统。所有这些都是一个好兆头,但历史告诉我们,人类法往往落后于技术发展,因此这些构建该领域的必要尝试的最终效果还有待观察。
The 2017 Asilomar Conference on Beneficial AI intentionally mirrored the earlier conference to raise awareness of the potential hazards associated with AI. It is now common to encounter conference sessions with titles like “AI for Good.” The 2017 Asilomar conference resulted in the development of a set of principles to guide the growth and application of artificial intelligence. Similarly, as of 2023, the US government—specifically, the White House Office of Science and Technology Policy—has developed a “Blueprint for an AI Bill of Rights” meant to protect the American public from the harmful effects of AI indiscriminately applied. Indeed, White House officials have taken pains to address the AI community directly to encourage proper consideration in developing even more powerful AI systems. All of this is a good sign, but history teaches that human law often lags behind technological development, so the ultimate effectiveness of these necessary attempts at framing the field remains to be seen.
2020 年的 COVID-19 大流行使世界大部分地区陷入停滞。然而,人工智能社区受到这一流行病的影响很小,可能是因为远程协作和会议在该领域运作得相当好。此外,我们可以通过互联网访问功能强大的计算机,因此物理上的接近不会限制研究,就像其他需要访问硬件的学科一样。
The COVID-19 pandemic of 2020 brought most of the world to a standstill. However, the AI community was only minimally impacted by the pandemic, likely because remote collaboration and conferences work rather well in this field. In addition, we can access powerful computers over the internet, so physical proximity doesn’t limit research as it might in other disciplines where access to the hardware is necessary.
从 2021 年开始,直到我撰写本文时为止,新型号不断涌现,每一个都比上一个更令人印象深刻。大多数可以接受人类编写的文本输入来生成文本、图像甚至视频输出。通常,输入会结合文本和示例图像来指导系统。当前流行(且令人关注)的系统包括 DALL-E 2、Stable Diffusion、Midjourney 和 ChatGPT 等。大多数大型科技公司还展示了生成视频、音频甚至 3D 模型的尖端技术。这些系统当前引起的轰动让人想起 2012 年的兴奋,当时深度学习通过 AlexNet 证明了它的价值。这是另一个范式转变;但需要明确的是,这些系统都没有意识。我并不是唯一持这种观点的人。甚至 ChatGPT 也同意我的观点。我问它:“你是一个有意识的存在吗?”以下是其逐字回复:
Beginning in 2021 and continuing as I write, an explosion of new models have appeared, each more impressive than the last. Most can accept text input written by humans to produce text, images, or even video output. Often, the input marries text and example imagery to guide the system. Current popular (and concerning) systems include DALL-E 2, Stable Diffusion, Midjourney, and ChatGPT, among others. Most large tech companies have also demonstrated cutting-edge techniques that generate video, audio, and even 3D models. The current buzz generated by these systems is reminiscent of the excitement in 2012, when deep learning proved its worth with AlexNet. It’s another paradigm shift; but, to be clear, none of these systems are conscious. I’m not alone in this opinion. Even ChatGPT agrees with me. I asked it: “Are you a conscious being?” Here’s its reply, verbatim:
作为一个机器学习模型,我不是一个有意识的存在。我是一个计算机程序,经过训练可以根据用户的输入生成文本。我没有像人类那样思考、推理或体验意识的能力。我的目的是提供信息和帮助用户,但我没有任何独立的思想或意识。
As a machine learning model, I am not a conscious being. I am a computer program that has been trained to generate text based on input from a user. I do not have the ability to think, reason, or experience consciousness in the same way that humans do. I am designed to provide information and assist users, but I do not have any independent thought or consciousness.
我认为艾达·洛夫莱斯会对这个答案感到满意,尽管她可能对一台不思考的机器如何生成它感到困惑。但请注意,ChatGPT 并没有声称不思考,只是它不思考就像人类一样。我们将在第 6 章中探讨图像合成,并在第 7 章中探讨ChatGPT 等大型语言模型。也许那时我们会找到解决她(假设的)困惑的方法。
I think Ada Lovelace would be satisfied with this answer, though she might be puzzled about how an unthinking machine could generate it. Note, however, that ChatGPT didn’t claim not to think, only that it doesn’t think like humans do. We’ll explore image synthesis in Chapter 6 and large language models like ChatGPT in Chapter 7. Perhaps then we’ll find a resolution to her (assumed) confusion.
****
****
那么,为什么现在呢?简而言之,符号人工智能的衰落和技术创新的兴起对联结主义方法非常有利。
So, why now? The short answer is the fall of symbolic AI and the rise of technological innovations highly favorable to the connectionist approach.
符号人工智能和联结主义一起出现,符号人工智能统治了数十年,并将联结主义推向了幕后。但在经历了两个让象征性人工智能几乎喘不过气来的人工智能冬天之后,在关键技术创新的帮助下,联结主义已经崛起,填补了这一空白。
Symbolic AI and connectionism emerged together, with symbolic AI dominating for decades and forcing connectionism into the background. But after two AI winters that left symbolic AI barely breathing, connectionism, assisted by key technological innovations, has risen to fill the void.
我认为符号人工智能和联结主义之间的关系类似于非鸟类恐龙和哺乳动物之间的关系。从地质学角度来说,恐龙和哺乳动物几乎同时出现,但大型陆生恐龙统治世界约 1.6 亿年,迫使哺乳动物在阴影中勉强生存。 6600 万年前,当小行星撞击地球时,大型恐龙被消灭,哺乳动物得以进化并取代恐龙。
I think of the relationship between symbolic AI and connectionism as akin to that between non-avian dinosaurs and mammals. Dinosaurs and mammals emerged at roughly the same time, geologically speaking, but large, terrestrial dinosaurs dominated the world for about 160 million years, forcing mammals to eke out an existence in the shadows. When the asteroid hit 66 million years ago, the large dinosaurs were wiped out, allowing the mammals to evolve and take over.
当然,类比最终是站不住脚的。恐龙并没有完全灭绝——我们现在称它们为鸟类——它们也没有灭绝,因为它们在某种程度上是低等的。事实上,恐龙是地球上最伟大的成功故事之一。非鸟类恐龙的死亡是因为普通的坏运气。几乎从字面上看,这对他们来说是一场灾难(“灾难”来自意大利disastro,意思是“厄运之星”)。
Of course, analogies ultimately break down. The dinosaurs didn’t die out completely—we now call them birds—and they didn’t go extinct because they were somehow inferior. In fact, the dinosaurs are one of Earth’s greatest success stories. Non-avian dinosaurs died because of plain old bad luck. It was, almost literally, a disaster that did them in (“disaster” from the Italian disastro, meaning “ill star”).
象征性人工智能可能会重新出现吗?它可能以某种形式存在,但与联结主义合作。符号人工智能承诺智能行为在抽象中是可能的,但它并没有兑现。联结主义声称智能行为可以从一组更简单的单元中产生。深度学习的成功支持了这一观点,更不用说目前地球上数十亿的活大脑了。但是,正如 ChatGPT 指出的那样,现有的联结主义模型“并不像人类那样思考、推理或体验意识”。现代神经网络不是思想,而是大脑。它们是表征学习数据处理器。我将在第 5 章中阐明这意味着什么。
Might symbolic AI reemerge? It’s likely in some form, but in cooperation with connectionism. Symbolic AI promised that intelligent behavior was possible in the abstract, and it didn’t deliver. Connectionism claims that intelligent behavior can emerge from a collection of simpler units. Deep learning’s successes support this view, to say nothing of the billions of living brains currently on the planet. But, as ChatGPT pointed out, existing connectionist models “do not think, reason, or experience consciousness in the same way that humans do.” Modern neural networks are not minds; they are representation-learning data processors. I’ll clarify what that means in Chapter 5.
尽管我们这个物种——智人——严重依赖象征性思维,但这并不是智力的必要条件。人类学家 Ian Tattersall在他的著作《理解人类进化》(剑桥大学出版社,2022 年)中声称,尼安德特人不太可能像我们一样使用符号思维,也不像我们一样拥有语言,但他们仍然很聪明。事实上,尼安德特人已经足够人性化,我们的祖先不止一次与他们“做爱,而不是战争”——非非洲血统的人的 DNA 证明了这一事实。
Though our species, Homo sapiens, relies critically on symbolic thought, it isn’t a requirement for intelligence. In his book Understanding Human Evolution (Cambridge University Press, 2022), anthropologist Ian Tattersall claims it was unlikely that Neanderthals used symbolic thought as we do, nor did they have language as we do, but that they were nonetheless intelligent. Indeed, the Neanderthals were sufficiently human for our ancestors to “make love, not war” with them more than once—the DNA of people of non-African ancestry testifies to this fact.
我预计在不久的将来联结主义和符号人工智能之间会产生协同作用。例如,因为像 ChatGPT 这样的系统最终只是预测下一个输出标记(单词或单词的一部分),所以它不知道什么时候说错了。相关的符号系统可以检测响应中的错误推理并纠正它。我不知道如何实施这样的系统。
I expect a synergy between connectionism and symbolic AI in the near future. For example, because a system like ChatGPT is, in the end, only predicting the next output token (word or part of a word), it can’t know when it’s saying something wrong. An associated symbolic system could detect faulty reasoning in the response and correct it. How such a system might be implemented, I don’t know.
****
****
到 20 世纪 60 年代初,联结主义可能出现的迹象已经很明显了。那么,仅仅是象征性的人工智能偏见才导致这场革命延迟了几十年吗?不。联结主义由于速度、算法和数据问题而停滞不前。让我们依次检查一下。
Hints of what might emerge from connectionism were evident by the early 1960s. So, was it only symbolic AI bias that delayed the revolution for so many decades? No. Connectionism stalled because of speed, algorithm, and data issues. Let’s examine each in turn.
为了理解为什么速度阻碍了联结主义的发展,我们需要了解计算机是如何工作的。充分自由地允许我们将计算机视为内存,它保存数据(数字)和处理单元,通常称为中央处理单元(CPU)。微处理器——就像台式电脑、智能手机、语音控制助手、汽车、微波炉以及几乎所有除烤面包机之外使用的其他设备中的微处理器(哦,还有许多烤面包机中的微处理器)——都是 CPU。将 CPU 视为传统计算机:数据从内存或键盘或鼠标等输入设备进入 CPU,进行处理,然后从 CPU 发送到内存或显示器或硬盘等输出设备。
To understand why speed stalled the growth of connectionism, we need to understand how computers work. Taking great liberties allows us to think of computers as memory, which holds data (numbers) and a processing unit, typically known as the central processing unit (CPU). A microprocessor—like the one in your desktop computer, smartphone, voice-controlled assistant, car, microwave, and virtually everything else you use that isn’t a toaster (oh, and in many toasters too)—is a CPU. Think of a CPU as a traditional computer: data comes into the CPU from memory or input devices like a keyboard or mouse, gets processed, then is sent out of the CPU to memory or an output device like a monitor or hard drive.
另一方面,图形处理单元 (GPU) 是为显示器(主要是视频游戏行业)而开发的,以实现快速图形处理。 GPU 可以同时对数百或数千个内存位置(读作:像素)执行相同的操作,例如“乘以 2” 。如果 CPU 想要将 1000 个内存位置乘以 2,则必须依次将第一个、第二个、第三个等等相乘。事实上,训练和实现神经网络所需的主要操作非常适合 GPU 的功能。 NVIDIA 等 GPU 制造商很早就意识到了这一点,并开始开发用于深度学习的 GPU。将 GPU 视为适合您 PC 的卡上的超级计算机。
Graphics processing units (GPUs), on the other hand, were developed for displays, primarily for the video game industry, to enable fast graphics. GPUs can perform the same operation, such as “multiply by 2,” on hundreds or thousands of memory locations (read: pixels) simultaneously. If a CPU wants to multiply a thousand memory locations by 2, it must multiply the first, second, third, and so on sequentially. As it happens, the primary operation needed to train and implement a neural network is ideally suited to what a GPU can do. GPU makers, like NVIDIA, realized this early and began developing GPUs for deep learning. Think of a GPU as a supercomputer on a card that fits in your PC.
1945 年,电子数值积分器和计算机 (ENIAC) 成为最先进的技术。 ENIAC 的速度估计约为每秒 0.00289 万条指令 (MIPS)。换句话说,ENIAC 一秒钟可以执行不到 3,000 条指令。 1980 年,普通的 6502 8 位微处理器(如当时大多数流行的个人计算机中的微处理器)的运行速度约为 0.43 MIPS,即每秒约 500,000 条指令。到 2023 年,我用来写本书的计算机中已经有些过时的 Intel i7-4790 CPU 的运行速度约为 130,000 MIPS,使我的 PC 比 1980 年的 6502 快约 300,000 倍,比 ENIAC 快约 4500 万倍。
In 1945, the Electronic Numerical Integrator and Computer (ENIAC) was state-of-the-art. ENIAC’s speed was estimated to be around 0.00289 million instructions per second (MIPS). In other words, ENIAC could perform just under 3,000 instructions in one second. In 1980, a stock 6502 8-bit microprocessor like the ones in most then-popular personal computers ran at about 0.43 MIPS, or some 500,000 instructions per second. In 2023, the already somewhat outdated Intel i7-4790 CPU in the computer I’m using to write this book runs at about 130,000 MIPS, making my PC some 300,000 times faster than the 6502 from 1980 and about 45 million times faster than ENIAC.
然而,NVIDIA 的 A100 GPU 在用于深度学习时,能够达到 312 teraflops (TFLOPS),即 312,000,000 MIPS:比 6502 快 7.3 亿倍,比 ENIAC 快1100亿倍,令人难以置信。随着机器学习的发展,计算能力的增长令人难以置信。此外,在巨大的数据集上训练大型神经网络通常需要数十到数百个此类 GPU。
However, NVIDIA’s A100 GPU, when used for deep learning, is capable of 312 teraflops (TFLOPS), or 312,000,000 MIPS: 730 million times faster than the 6502 and an unbelievable 110 billion times faster than ENIAC. The increase in computational power over the timespan of machine learning boggles the mind. Moreover, training a large neural network on an enormous dataset often requires dozens to hundreds of such GPUs.
结论:在快速 GPU 出现之前,计算机的速度太慢,无法训练具有构建 ChatGPT 等功能所需容量的神经网络。
Conclusion: Computers were, until the advent of fast GPUs, too slow to train neural networks with the capacity needed to build something like ChatGPT.
正如您将在第 4 章中学到的,我们从执行简单任务的基本单元构建神经网络:收集输入值,将每个值乘以权重值,求和,添加偏差值,并将结果传递给激活函数以创建一个输出值。换句话说,许多输入数变成一个输出数。数千到数百万个这样的单元产生的集体行为导致数十亿的权重值,让深度学习系统能够完成它们的工作。
As you’ll learn in Chapter 4, we construct neural networks from basic units that perform a simple task: collect input values, multiply each by a weight value, sum, add a bias value, and pass the result to an activation function to create an output value. In other words, many input numbers become one output number. The collective behavior emerging from thousands to millions of such units leading to billions of weight values lets deep learning systems do what they do.
神经网络的结构是一回事,另一回事是另一回事。将神经网络调节到所需的任务是另一回事。将网络的结构(称为架构)视为解剖学。在解剖学中,我们对身体的构成感兴趣:这是心脏,那是肝脏,等等。训练网络更像是生理学:一个部分如何与另一部分协同工作?解剖学(架构)已经存在,但生理学(训练过程)尚未完全理解。几十年来,这种情况发生了变化,这要归功于关键的算法创新:反向传播、网络初始化、激活函数、丢失和归一化以及先进的梯度下降算法。详细理解这些术语并不重要,只需知道这些术语所代表的内容的改进——以及已经提到的处理速度的改进,再加上改进的数据集(即将讨论)——是深度学习革命的主要推动者。
The structure of a neural network is one thing; conditioning the neural network to the desired task is another. Think of the network’s structure, known as its architecture, as anatomy. In anatomy, we’re interested in what constitutes the body: this is the heart, that’s the liver, and so on. Training a network is more like physiology: how does one part work with another? The anatomy (architecture) was there, but the physiology (training process) was incompletely understood. That changed over the decades, courtesy of key algorithmic innovations: backpropagation, network initialization, activation functions, dropout and normalization, and advanced gradient descent algorithms. It’s not essential to understand the terms in detail, only to know that improvements in what these terms represent—along with the already mentioned improvements in processing speed, combined with improved datasets (discussion coming up)—were primary enablers of the deep learning revolution.
虽然人们早就知道正确的权重和偏差值将使网络适应所需的任务,但几十年来一直缺少找到这些值的有效方法。 20 世纪 80 年代引入反向传播算法,结合随机梯度下降,开始改变这一状况。
While it was long known that the right weight and bias values would adapt a network to the desired task, what was missing for decades was an efficient way to find those values. The 1980s’ introduction of the backpropagation algorithm, combined with stochastic gradient descent, began to change this.
训练根据模型在训练数据上的误差迭代地定位最终的权重和偏差值集。迭代过程从初始状态、一些初始权重和偏差集开始重复。然而,那些初始权重和偏差应该是多少?很长一段时间以来,人们都认为初始权重和偏差并不重要。只需在某个范围内随机选择较小的数字即可。这种方法通常有效,但很多时候却不起作用,导致网络无法很好地学习(如果有的话)。需要一种更有原则的方法来初始化网络。
Training iteratively locates the final set of weight and bias values according to the model’s errors on the training data. Iterative processes repeat from an initial state, some initial set of weights and biases. However, what should those initial weights and biases be? For a long time, it was assumed that the initial weights and biases didn’t matter much; just select small numbers at random over some range. This approach often worked, but many times it didn’t, causing the network not to learn well, if at all. A more principled approach to initializing networks was required.
现代网络仍然是随机初始化的,但随机值取决于网络的架构和所使用的激活函数的类型。关注这些细节可以让网络更好地学习。初始化很重要。
Modern networks are still initialized randomly, but the random values depend on the network’s architecture and the type of activation function used. Paying attention to these details allowed networks to learn better. Initialization matters.
我们将神经网络分层排列,其中一层的输出成为下一层的输入。分配给网络中每个节点的激活函数决定了该节点的输出值。从历史上看,激活函数要么是 S 型函数,要么是双曲正切函数,这两种函数在绘制图形时都会产生S形曲线。这些函数在大多数情况下都是不合适的,最终被一个名字很长却掩盖了其简单性的函数所取代:修正线性单元(ReLU)。 ReLU 询问简单的问题:输入是否小于零?如果是,则输出为零;否则,输出就是输入值。 ReLU 激活函数不仅比旧函数更好,而且计算机几乎可以立即提出并回答该问题。因此,改用 ReLU 是双赢:提高网络性能和速度。
We arrange neural networks in layers, where the output of one layer becomes the input of the next. The activation function assigned to each node in the network determines the node’s output value. Historically, the activation function was either a sigmoid or a hyperbolic tangent, both of which produce an S-shaped curve when graphed. These functions are, in most cases, inappropriate, and were eventually replaced by a function with a long name that belies its simplicity: the rectified linear unit (ReLU). A ReLU asks a simple question: is the input less than zero? If so, the output is zero; otherwise, the output is the input value. Not only are ReLU activation functions better than the older functions, but computers can ask and answer that question virtually instantaneously. Switching to ReLUs was, therefore, a double win: improved network performance and speed.
Dropout 和批量归一化是高级训练方法,在我们想要了解的水平上有些难以描述。Dropout于2012年推出,在训练时将一层节点的部分输出随机设置为零。其效果就像同时训练数千个模型,每个模型独立但又相互关联。在适当的情况下,辍学会对网络学习产生巨大影响。正如一位著名计算机科学家当时告诉我的那样,“如果我们在 20 世纪 80 年代就辍学了,那么现在的世界将是一个不同的世界。”
Dropout and batch normalization are advanced training approaches that are somewhat difficult to describe at the level we care to know about them. Introduced in 2012, dropout randomly sets parts of the output of a layer of nodes to zero when training. The effect is like training thousands of models simultaneously, each independent but also linked. Dropout, when appropriate, has a dramatic impact on network learning. As a prominent computer scientist told me at the time, “If we had had dropout in the 1980s, this would be a different world now.”
批量标准化调整数据在网络中流动时在层之间移动的情况。输入出现在网络的一侧,并流经各层到达输出。示意性地讲,这通常表现为从左到右的运动。在层之间插入归一化以更改值以使它们保持在有意义的范围内。批量归一化是第一个可学习的归一化技术,这意味着它随着网络的学习而学习应该做什么。一整套标准化方法是从批量标准化发展而来的。
Batch normalization adjusts the data moving between layers as it flows through the network. Inputs appear on one side of the network and flow through layers to get to the output. Schematically, this is usually presented as a left-to-right motion. Normalization is inserted between the layers to change the values to keep them within a meaningful range. Batch normalization was the first learnable normalization technique, meaning it learned what it should do as the network learned. An entire suite of normalization approaches evolved from batch normalization.
促成深度学习革命的最后一个关键算法创新涉及梯度下降,它与反向传播一起使用以促进权重和偏差的学习。梯度下降背后的想法比机器学习要古老得多,但过去十年左右开发的版本为深度学习的成功做出了很大贡献。我们将在第 4 章中了解有关此主题的更多信息。
The last critical algorithmic innovation enabling the deep learning revolution involves gradient descent, which works with backpropagation to facilitate learning the weights and biases. The idea behind gradient descent is far older than machine learning, but the versions developed in the last decade or so have contributed much to deep learning’s success. We’ll learn more about this subject in Chapter 4.
结论:训练神经网络的第一种方法很原始,无法发挥其真正的潜力。算法创新改变了这一点。
Conclusion: The first approaches to training neural networks were primitive and unable to take advantage of their true potential. Algorithmic innovations changed that.
神经网络需要大量的训练数据。当人们问我为特定任务训练特定模型需要多少数据时,我的答案总是相同的:全部。模型从数据中学习;越多越好,因为更多的数据意味着可以更好地表示模型在使用时会遇到的情况。
Neural networks require lots of training data. When people ask me how much data is necessary to train a particular model for a specific task, my answer is always the same: all of it. Models learn from data; the more, the better because more data means an improved representation of what the model will encounter when used.
在万维网出现之前,收集、标记和处理训练深度神经网络所需的大量数据集被证明是困难的。随着网络的迅猛发展及其所代表的数据的爆炸式增长,这种情况在 20 世纪 90 年代末和 2000 年代初发生了变化。
Before the World Wide Web, collecting, labeling, and processing datasets of the magnitude necessary to train a deep neural network proved difficult. This changed in the late 1990s and the early 2000s with the tremendous growth of the web and the explosion of data it represented.
例如,Statista(https://www.statista.com )声称,2022 年,每分钟有 500 小时的新视频上传到 YouTube 。据估计,1995 年 12 月约有 1600 万人使用网络,占世界人口的 0.4%。到 2022 年 7 月,这一数字已增长至近 55 亿,即 69%。社交媒体的使用、电子商务以及携带智能手机从一个地方移动到另一个地方就足以产生数量惊人的数据——所有这些数据都被捕获并用于人工智能。社交媒体是免费的,因为我们以及我们生成的数据就是产品。
For example, Statista (https://www.statista.com) claims that in 2022, 500 hours of new video were uploaded to YouTube every minute. It’s also estimated that approximately 16 million people were using the web in December 1995, representing 0.4 percent of the world’s population. By July 2022, that number had grown to nearly 5.5 billion, or 69 percent. Social media use, e-commerce, and simply moving from place to place while carrying a smartphone are enough to generate staggering amounts of data—all of which is captured and used for AI. Social media is free because we, and the data we generate, are the product.
我在工作中经常听到的一句话是“我们曾经缺乏数据,但现在我们被数据淹没了”。如果没有大型数据集和足够的标签,深度学习就无法学习。但是,另一方面,对于大型数据集,令人惊叹的事情可能会发生。
A phrase I often hear in my work is “we used to be data-starved, but now we’re drowning in data.” Without large datasets and enough labels to go with them, deep learning cannot learn. But, on the other hand, with large datasets, awe-inspiring things can happen.
结论:在机器学习中,数据就是一切。
Conclusion: In machine learning, data is everything.
****
****
本章的主要内容是:
The main takeaways from this chapter are:
有了足够完整的历史背景来满足我们的目的,让我们回到机器学习,从经典算法开始。
With this historical background complete enough for our purposes, let’s return to machine learning, starting with the classical algorithms.
刚开始学习钢琴的学生不会从李斯特的《La Campanella》开始,而是从《Mary Had a Little Lamb》或《Twinkle, Twinkle, Little Star》开始。较简单的乐曲包含弹钢琴的基础知识,掌握基础知识可以让学生随着时间的推移不断进步。这一原则适用于大多数研究领域,包括人工智能。
Beginning piano students don’t start with Liszt’s “La Campanella,” but “Mary Had a Little Lamb” or “Twinkle, Twinkle, Little Star.” The simpler pieces contain the basics of playing the piano, and mastering the basics allows students to progress over time. This principle holds in most areas of study, including artificial intelligence.
为了实现理解现代人工智能的最终目标,我们必须从经典机器学习的“更简单”世界开始。对于经典模型来说,对于更先进的神经网络来说通常也是如此。本章探讨了三种经典模型:最近邻、随机森林和支持向量机。理解这些将为我们学习第 4 章的神经网络做好准备。
To reach our ultimate goal of understanding modern AI, we must begin in the “simpler” world of classical machine learning. What holds for the classical models is generally true for more advanced neural networks. This chapter explores three classical models: nearest neighbors, random forests, and support vector machines. Understanding these will prepare us for the neural networks of Chapter 4.
****
****
图 3-1显示了具有两个特征( x 0和x 1)和三个类别(圆形、正方形和三角形)的虚构数据集的训练样本。我们在第一章中看到了类似的情节;见图1-2。与虹膜数据集一样,图中的每个形状都代表训练集中的一个样本。图 3-1是我们用来理解最近邻经典模型的工具。
Figure 3-1 shows the training samples for a made-up dataset with two features (x0 and x1) and three classes (circles, squares, and triangles). We saw a similar plot in Chapter 1; see Figure 1-2. As with the iris dataset, every shape in the figure represents a sample from the training set. Figure 3-1 is the tool we’ll use to understand the nearest neighbors classical model.
图 3-1:包含三个类别和两个特征的虚构训练集
Figure 3-1: A made-up training set with three classes and two features
正如前一章提到的,最近邻分类器是最简单的模型——简单到不需要训练模型;训练数据就是模型。要将类标签分配给新的未知输入,请找到最接近未知样本的训练样本并返回该样本的标签。这里的所有都是它的。尽管很简单,但如果训练数据代表模型在野外会遇到的情况,最近邻分类器就非常有效。
As mentioned in the previous chapter, nearest neighbor classifiers are the simplest of models—so simple that there’s no model to train; the training data is the model. To assign a class label to a new, unknown input, find the training sample closest to the unknown sample and return that sample’s label. That’s all there is to it. Despite their simplicity, nearest neighbor classifiers are quite effective if the training data represents what the model will encounter in the wild.
作为最近邻模型的自然扩展,定位最接近未知样本的k 个训练样本。 k通常是 3、5 或 7 等数字,但它可以是任何数字。这种类型的模型使用多数投票系统,因此分配的类标签是k 个训练样本中最常见的标签。如果有平局,则随机选择标签。例如,如果模型正在考虑未知样本的 5 个最近邻,其中两个是 0 类,另外两个是 3 类,则通过在 0 和 3 之间随机选择来分配标签;平均而言,您在 50% 的情况下会做出正确的选择。
As a natural extension to the nearest neighbor model, locate the k training samples nearest the unknown sample. k is often a number like 3, 5, or 7, though it can be any number. This type of model uses a majority voting system, so the assigned class label is the one that’s most common among the k training samples. If there’s a tie, select the label randomly. For example, if the model is contemplating the 5-nearest neighbors to an unknown sample, and two are class 0 while another two are class 3, then assign the label by choosing randomly between 0 and 3; on average, you’ll make the correct choice 50 percent of the time.
让我们使用最近邻概念来对一些未知输入进行分类。图 3-2再次显示了训练样本,以及两个未知样本:菱形和五边形。我们希望将这些样本分配给三个类别之一:圆形、正方形或三角形。最近邻方法是指找到距离每个未知样本最近的训练样本。对于菱形来说,就是它左上角的正方形;对于五边形,它似乎是右上角的三角形。因此,最近邻分类器将正方形类分配给菱形,将三角形类分配给五边形。
Let’s use the nearest neighbor concept to classify some unknown inputs. Figure 3-2 shows the training samples again, along with two unknown samples: the diamond and the pentagon. We want to assign these samples to one of the three classes: circle, square, or triangle. The nearest neighbor approach says to locate the training sample closest to each unknown sample. For the diamond, that’s the square to its upper left; for the pentagon, it appears to be the triangle to the upper right. Therefore, a nearest neighbor classifier assigns class square to the diamond and class triangle to the pentagon.
图 3-2:未知样本分类
Figure 3-2: Classifying unknown samples
我怀疑您已经注意到图 3-2中未知样本与三个最近的训练样本之间的连接线。如果k为 3 ,则使用这些样本。在这种情况下,分类器将再次将正方形类别分配给菱形,因为所有三个最近的训练样本都是正方形。对于五边形,三个最近邻中的两个是三角形,一个是正方形,因此它也会再次将三角形分配给五边形。
I suspect you’ve noticed the lines connecting the unknown samples in Figure 3-2 to the three nearest training samples. These are the samples to use if k is 3. In this case, the classifier would again assign class square to the diamond, because all three of the nearest training samples are squares. For the pentagon, two of the three nearest neighbors are triangles and one is a square, so it would also again assign class triangle to the pentagon.
此示例使用二维特征向量x 0和x 1,因此我们可以可视化该过程。我们并不局限于只有两个特征的模型;我们可以有数十个甚至数百个。即使有太多要绘制的特征,“最近”(距离)的概念仍然具有数学意义。事实上,许多数学概念都可以作为距离度量,并且在实践中,最近邻分类器可以根据数据集使用任何度量。
This example uses two-dimensional feature vectors, x0 and x1, so we can visualize the process. We’re not restricted to models with only two features; we can have dozens or even hundreds. The idea of “nearest” (distance) still has mathematical meaning even when there are too many features to graph. Indeed, many mathematical concepts qualify as distance measures, and in practice, nearest neighbor classifiers may use any of the measures depending on the dataset.
例如,让我们回到第一章的 MNIST 数字数据集。这些样本是数字 0 到 9 的小型灰度图像,我们将其分解为 784 个元素的向量。因此,训练集中的每个数字样本都是 784 维空间中的单个点,就像前面的示例中每个样本是 2 维空间中的一个点一样。
For example, let’s return to Chapter 1’s MNIST digits dataset. The samples are small, grayscale images of the digits 0 through 9 that we unravel into vectors of 784 elements. Therefore, each digit sample in the training set is a single point in a 784-dimensional space, just as in the previous example each sample was a point in a 2-dimensional space.
完整的 MNIST 数据集有 60,000 个训练样本,这意味着训练空间由分散在 784 维空间中的 60,000 个点组成(不完全是,但很快会有更多)。它还具有 10,000 个测试样本,我们可以使用它们来评估最近邻模型。我使用全部 60,000 个训练样本训练 1-近邻模型,然后使用 6,000 个样本,然后使用 600 个样本,最后仅使用 60 个样本。训练集中的 60 个样本意味着每个数字大约有 6 个示例。我说“关于”是因为我对训练集进行了随机采样,因此可能有八个一位数字,而只有三位其他数字。在每种情况下,我都使用所有 10,000 个测试样本来测试模型,从而模仿在现实世界中使用该模型。
The full MNIST dataset has 60,000 training examples, meaning the training space consists of 60,000 points scattered throughout the 784-dimensional space (not quite, but more on that soon). It also has 10,000 test samples that we can use to evaluate the nearest neighbor model. I trained 1-nearest neighbor models using all 60,000 training samples, then 6,000 samples, then 600, before ending with a mere 60. Sixty samples in the training set implies about six examples of each digit. I say “about” because I sampled the training set randomly, so there might be eight of one digit and only three of another. In every case, I tested the model using all 10,000 test samples, thereby mimicking using the model in the real world.
表 3-1显示了随着训练样本数量变化模型的性能。
Table 3-1 shows the model’s performance as the number of training examples changed.
表 3-1:更改训练集大小
Table 3-1: Changing the Training Set Size
训练集大小 Training set size |
准确性 (%) Accuracy (%) |
|---|---|
60,000 60,000 |
97 97 |
6,000 6,000 |
94 94 |
600 600 |
86 86 |
60 60 |
66 66 |
回想一下,准确度是模型通过分配正确的数字标签(0 到 9)正确分类的测试样本的百分比。使用整个训练集时,模型平均 100 次中正确的次数为 97 次。即使训练集缩小 10 倍,准确率仍然为 94%。通过 600 个训练示例(大约每个数字 60 个),准确率下降到 86%。只有当训练集平均缩小到每个数字只有 6 个示例时,准确率才会急剧下降至 66%。
Recall that accuracy is the percentage of the test samples that the model classified correctly by assigning the correct digit label, 0 through 9. When using the entire training set the model is correct 97 times out of 100, on average. Even when the training set is made 10 times smaller, the accuracy is still 94 percent. With 600 training examples—about 60 per digit—the accuracy falls to 86 percent. It’s only when the training set shrinks to a mere six examples of each digit, on average, that the accuracy falls dramatically to 66 percent.
然而,在我们对最近邻模型过于苛刻之前,请记住,有 10 个数字类别,因此随机猜测平均来说是正确的,大约十分之一,准确度约为 10%。从这个角度来看,即使是 60 个样本的模型也比随机猜测好六倍。让我们稍微探讨一下这种现象,看看我们是否可以深入了解为什么最近邻模型在训练数据如此少的情况下表现良好。
However, before we’re too harsh on our nearest neighbor model, remember that there are 10 digit classes, so random guessing will be correct, on average, about 1 time in 10, for an accuracy of about 10 percent. In this light, even the 60-sample model is six times better than guessing randomly. Let’s explore this phenomenon a bit to see if we can gain some insight into why the nearest neighbor model does well with so little training data.
想象一下,您独自一人在篮球场上,坐在球场中央。竞技场某处的空气中悬浮着一粒灰尘。为了方便起见,斑点保持固定在其位置。现在想象一下空气中还有 59 个灰尘颗粒。这 60 个灰尘点就是我们训练集中的 60 个数字样本,而竞技场就是数字图像向量所在的三维世界。
Imagine you’re alone in a basketball arena, sitting in the middle of the court. A speck of dust is suspended in the air somewhere in the arena. For convenience, the speck stays fixed in its position. Now imagine 59 more specks of dust inhabiting the air. Those 60 specks of dust are the 60 digit samples in our training set, and the arena is the three-dimensional world in which the digit image vectors live.
现在想象一粒新的灰尘出现在你的鼻子前面。这是您想要分类的新数字向量。最近邻模型计算该灰尘斑点与您知道其数字标签的 60 个灰尘斑点之间的距离。距离新灰尘最近的一粒灰尘位于您所面对的篮子边缘下方,距离 47 英尺(14 米)。它是 3,因此模型返回标签 3。认为最接近的斑点代表未知样本的正确标签是否合理?毕竟整个赛场上的灰尘只有六十粒。
Now imagine a new speck of dust has appeared right in front of your nose. It’s a new digit vector you want to classify. The nearest neighbor model calculates the distance between that speck of dust and the 60 specks whose digit labels you know. The closest speck of dust to the new one is below the rim of the basket you’re facing, at a distance of 47 feet (14 meters). It’s a three, so the model returns a label of 3. Is it reasonable to think that the closest speck represents the proper label for the unknown sample? After all, there are only 60 specks of dust in the whole arena.
我们需要考虑两种相互竞争的影响才能为这个问题提供合理的答案。首先,我们应该回答“不”,因为相信我们可以用 60 个灰尘点来代表竞技场的巨大体积,这似乎很愚蠢。训练集中的数据太少,无法填满竞技场的空间。这种观察被称为维数灾难,它指的是这样一个事实:随着维数的增加,填充空间所需的样本数量也会以非常快的速度增加。换句话说,数量点迅速增加,这意味着表示空间所需的训练样本数量迅速增加——更准确地说,呈指数增长。维数灾难是经典机器学习的祸根之一。
We need to consider two competing effects to provide a reasonable answer to this question. First, we should answer “no” because it seems silly to believe that we can represent the giant volume of the arena with 60 specks of dust. There’s too little data in the training set to fill the arena’s space. This observation, known as the curse of dimensionality, refers to the fact that as the number of dimensions increases, so too, at a very rapid rate, does the number of samples needed to fill the space. In other words, the number of points increases rapidly, meaning the number of training samples necessary to represent the space increases rapidly—exponentially, to be more precise. The curse of dimensionality is one of the banes of classical machine learning.
维数灾难表明,当我们只有 60 个训练样本和 784 个维度时,我们就没有希望对数字进行正确分类。 。 。但我们的最近邻分类器仍然有效。不是很好,但比随机猜测要好。为什么?原因与数字数据集以及不同类别的示例之间的相似程度有关。所有 5 的例子看起来都像 5;如果他们不这样做,我们就不会认出他们是五人组。因此,虽然数字空间有 784 个维度,但一类中的大多数数字将相对接近该类的其他数字。换句话说,代表五的尘埃斑点可能彼此靠近聚集或分组,可能在一个蜿蜒穿过竞技场的细长管状区域中。其他数字可能也以类似方式分组。因此,在考虑维数灾难时,最近的样本来自同一数字类别的机会比我们最初怀疑的要大。基于这一观察,我们将“不”的答案升级为软弱无力的“可能”。
The curse of dimensionality says we should have no hope of properly classifying digits when we have only 60 training samples and 784 dimensions . . . yet our nearest neighbor classifier still works. Not very well, but better than random guessing. Why? The reason has to do with the digits dataset and how similar examples of the different classes are to each other. All examples of fives look like a 5; if they didn’t, we wouldn’t recognize them as fives. Therefore, while there are 784 dimensions to the space of digits, most digits in a class will land relatively close to that class’s other digits. In other words, the specks of dust representing fives are likely clustered or grouped near each other, probably in a thin, tube-like region that snakes its way through the arena. The other digits are likely grouped similarly. Because of this, the nearest sample has a better chance of being from the same digit class than we initially suspected when considering the curse of dimensionality. Based on this observation, we upgrade our “no” answer to a wishy-washy “probably.”
我们在数学上讨论这种效应时说,数字数据位于有效维数远低于表示数字的向量的 784 维的流形上。如果我们能够利用这些信息,那么数据通常位于低维流形上,这将是一个福音。最近邻模型使用该信息,因为训练数据就是模型。在本书后面,当我们讨论卷积神经网络时,我们将了解到此类模型学习表示其输入的新方法,这类似于学习如何表示数据所在的低维流形。
We talk about this effect mathematically by saying that the digit data lies on a manifold with an effective dimensionality that is well below the 784 dimensions of the vectors representing the digits. That data often lies on lower-dimensional manifolds is a boon if we can make use of that information. The nearest neighbor model uses the information because the training data is the model. Later in the book, when we discuss convolutional neural networks, we’ll understand that such models learn new ways to represent their inputs, which is akin to learning how to represent the lower-dimensional manifold on which the data lives.
不过,在我们对最近邻分类器在数字数据集上的表现感到兴奋之前,让我们通过尝试对真实图像进行分类来回到现实。 CIFAR-10 数据集由来自 10 个不同类别的 50,000 张 32×32 像素彩色小图像组成,其中包括飞机、汽车和卡车等车辆以及狗、猫和鸟类等动物。分解这些图像中的每一个都会创建一个包含 3,072 个元素的向量,因此我们要求分类器在 3,072 维空间中分离图像。表 3-2显示了它的表现。
Before we get too excited about how well our nearest neighbor classifier performs with the digits dataset, though, let’s bring ourselves back to reality by attempting to classify real images. The CIFAR-10 dataset consists of 50,000 small 32×32-pixel color images from 10 different classes, including a mix of vehicles, like airplanes, cars, and trucks, and animals, like dogs, cats, and birds. Unraveling each of these images creates a vector of 3,072 elements, so we’re asking our classifier to separate images in a 3,072-dimensional space. Table 3-2 shows how it fares.
表 3-2:使用最近邻对 CIFAR-10 进行分类
Table 3-2: Classifying CIFAR-10 with Nearest Neighbor
训练集大小 Training set size |
准确性 (%) Accuracy (%) |
|---|---|
50,000 50,000 |
35.4 35.4 |
5,000 5,000 |
27.1 27.1 |
500 500 |
23.3 23.3 |
50 50 |
17.5 17.5 |
与 MNIST 一样,随机猜测的准确度为 10%。虽然我们的分类器在各种训练中的表现都比这更好集大小时,其最佳准确率略高于 35%,远低于 MNIST 达到的 97%。像这样发人深省的认识让机器学习社区的许多人感叹通用图像分类可能超出了我们的掌握范围。值得庆幸的是,事实并非如此,但经典的机器学习模型都不能很好地做到这一点。
As with MNIST, random guessing leads to an accuracy of 10 percent. While our classifier performs better than this with all variations of training set size, its best accuracy is little more than 35 percent—nowhere near the 97 percent achieved with MNIST. Sobering realizations like this led many in the machine learning community to lament that generic image classification might be beyond our grasp. Thankfully, it isn’t, but none of the classical machine learning models do it well.
如果我们从流形的角度思考——数据通常存在于比数据本身维数更低的维度空间中——那么这些结果就不足为奇了。 CIFAR-10 包含真实世界的照片,通常称为自然图像。自然图像比 MNIST 数字等简单图像复杂得多,因此我们应该期望它们存在于更高维的流形中,因此更难学习分类。事实上,有一些数值方法可以估计数据的真实维度。对于 MNIST,尽管图像存在于 784 维空间中,但数据更接近 11 维。对于 CIFAR-10,内在维度更接近 21 维,因此我们预计需要更多的训练数据才能与 MNIST 相媲美。
If we think in terms of manifolds—the idea that data often lives in a lower-dimensional space than the dimensionality of the data itself—then these results aren’t surprising. CIFAR-10 contains real-world photographs, often referred to as natural images. Natural images are far more complex than simple images like MNIST digits, so we should expect them to exist in a higher-dimensional manifold and consequently be harder to learn to classify. As it happens, there are numerical approaches to estimating the true dimensionality of data. For MNIST, even though the images live in a 784-dimensional space, the data is closer to 11-dimensional. For CIFAR-10, the intrinsic dimensionality is closer to 21 dimensions, so we expect to need far more training data to perform on par with MNIST.
最近邻模型现在不经常使用。有两个问题可以解释为什么。首先,虽然训练最近邻模型实际上是瞬时的,因为没有什么可训练的,但使用最近邻模型的速度很慢,因为我们必须计算未知样本与每个训练集样本之间的距离。该计算时间随着训练集中样本数量的平方而增长。我们拥有的训练数据越多,我们期望模型的性能就越好,但运行速度就越慢。训练集大小加倍,搜索时间增加四倍。
Nearest neighbor models aren’t used often these days. Two issues contribute to why. First, while training a nearest neighbor model is effectively instantaneous because there’s nothing to train, using a nearest neighbor model is slow because we have to calculate the distance between the unknown sample and each of the training set samples. This calculation time grows as the square of the number of samples in the training set. The more training data we have, the better we expect the model to perform, but the slower it runs. Double the size of the training set, and the search time increases by a factor of four.
对最近邻分类器数十年的研究已经发现了各种技巧来减少寻找最近邻或最近k 个邻点所需的时间,但效果仍然存在:增加训练样本的数量会增加使用分类器所需的时间。
Decades of study of nearest neighbor classifiers have uncovered all manner of tricks to mitigate the time it takes to find the nearest neighbor, or nearest k neighbors, but the effect remains: increasing the number of training samples increases the time it takes to use the classifier.
第二个问题是所有经典机器学习模型以及我们将在第 4 章中讨论的传统神经网络所共有的。这些模型是整体的,这意味着它们将输入向量解释为没有部分的单个实体。在许多情况下,这不是正确的做法。例如,写“四”需要使用多个笔画,并且有明确的部分来区分“四”和“八”。经典机器学习模型不会明确了解这些部分或它们出现的位置,或者它们可能出现在多个位置。然而,现代卷积神经网络确实可以学习这些东西。
The second issue is common to all classical machine learning models, as well as the traditional neural networks we’ll discuss in Chapter 4. These models are holistic, meaning they interpret their input vectors as a single entity without parts. This is not the right thing to do in many cases. For example, writing a four uses multiple strokes, and there are definite parts that distinguish the four from an eight. Classical machine learning models don’t explicitly learn about these parts or where they appear, or that they might appear in multiple locations. Modern convolutional neural networks, however, do learn these things.
总之,最近邻模型易于理解,训练起来也很简单,但使用起来很慢,并且无法明确理解输入中的结构。让我们换个角度来思考森林和树木。
In sum, nearest neighbor models are straightforward to understand and trivial to train, but slow to use and unable to explicitly understand structure in their inputs. Let’s change gears to contemplate the forest and the trees.
****
****
在第一章中,我们简要探讨了决策树,其中包括一系列关于未知样本的是/否问题。您从根节点开始,通过回答节点的问题来遍历树。如果答案是“是的”,向左下移一层。如果答案是“否”,请向右下方移动。继续回答问题,直到到达叶子(没有问题的节点),并为未知样本分配叶子节点中的任何标签。
We briefly explored decision trees, comprising a series of yes/no questions asked about an unknown sample, in Chapter 1. You begin at the root node and traverse the tree by answering the node’s question. If the answer is “yes,” move down one level to the left. If the answer is “no,” move down to the right. Continue answering questions until you reach a leaf (a node with no question), and assign the unknown sample whatever label is in the leaf node.
决策树是确定性的;一旦构建,它们就不会改变。因此,传统的决策树算法对于相同的训练集返回相同的决策树。通常情况下,这棵树并不能很好地发挥作用。如果发生这种情况,我们能做些什么吗?是的!我们可以种植一片树木。
Decision trees are deterministic; once constructed, they don’t change. Therefore, traditional decision tree algorithms return the same decision tree for the same training set. More often than not, the tree doesn’t work all that well. If that happens, is there anything we can do? Yes! We can grow a forest of trees.
但如果决策树是确定性的,那么森林岂不是只不过是同一棵树,一遍又一遍,就像一大堆克隆体一样?如果我们在此过程中不做任何聪明的事情,它就会发生。幸运的是,人类是聪明的。研究人员在 2000 年左右意识到,引入随机性会产生一片独特的树木森林,每棵树都有自己的优点和缺点,但总的来说比任何一棵树都要好。随机森林是决策树的集合,每个决策树随机地彼此不同。森林的预测是树木预测的组合。随机森林是群体智慧的体现。
But if decision trees are deterministic, won’t the forest be nothing more than the same tree, over and over, like a mass of clones? It will, if we don’t do anything clever along the way. Fortunately, humans are clever. Researchers realized around the year 2000 that introducing randomness produces a forest of unique trees, each with its own strengths and weaknesses, but collectively better than any single tree. A random forest is a collection of decision trees, each randomly different from the others. The forest’s prediction is a combination of its trees’ predictions. Random forests are a manifestation of the wisdom of crowds.
使用随机性来构建分类器乍一看似乎违反直觉。如果周二我们向模型展示样本 X,它告诉我们样本 X 是 Y 类的成员,那么如果我们碰巧在 上展示相同的样本,我们不希望它告诉我们它是 Z 类的成员。周六。幸运的是,随机森林的随机性并不是这样工作的。给出经过训练的森林样本 X 作为输入,它总是为我们提供 Y 类作为输出,即使是 2 月 29 日。
Using randomness to build a classifier seems counterintuitive at first. If on Tuesday we present the model with sample X and it tells us that sample X is a member of class Y, then we don’t want it to tell us that it’s a member of class Z if we happen to present the same sample on Saturday. Fortunately, the randomness of a random forest doesn’t work that way. Give a trained forest sample X as input, and it always gives us class Y as output, even if it’s February 29.
生成随机森林需要三个步骤:装袋(也称为引导)、随机特征选择和集成。装袋和随机特征选择有助于对抗过度拟合,这是第 1 章中提到的概念。单一决策树容易出现过度拟合。
Three steps go into growing a random forest: bagging (also called bootstrapping), random feature selection, and ensembling. Bagging and random feature selection help combat overfitting, a concept mentioned in Chapter 1. Single decision trees are prone to overfitting.
所有三个步骤共同作用,形成决策树森林,其组合输出产生(希望)性能更好的模型。可解释性是这种权力增益所付出的代价。单个决策树通过产生其输出的一系列问题和答案来解释自己。由于数十或数百个决策树组合了它们的输出,可解释性就消失了,但在很多情况下我们可以忍受这一点。
All three steps work together to grow a forest of decision trees whose combined outputs produce a (hopefully) better-performing model. Explainability is the price paid for this gain in power. A single decision tree explains itself by the series of questions and answers that produce its output. With dozens or hundreds of decision trees combining their output, explainability goes out the window, but we can live with that in many cases.
正如我已经多次提到的,训练集是调节模型的关键。对于随机森林来说也是如此。我们有一个训练集作为起点。当我们逐棵决策树生长森林时,我们使用现有的训练集来创建当前决策树特有的特定于树的训练集。这就是装袋的用武之地。
As I’ve already mentioned several times, the training set is key to conditioning the model. This remains true with random forests. We have as a starting point a training set. As we grow the forest, decision tree by decision tree, we use the existing training set to create tree-specific training sets unique to the current decision tree. This is where bagging comes in.
Bagging是指通过带替换的随机采样从当前数据集构建新的数据集。 “有替换”一词意味着我们可能会多次选择训练样本,或者根本不选择。该技术在统计学中用于了解测量的范围。我们将使用以下测试分数示例数据集来弄清楚这意味着什么:
Bagging refers to constructing a new dataset from the current dataset by random sampling with replacement. The phrase “with replacement” means we might select a training sample more than once or not at all. This technique is used in statistics to understand a measurement’s bounds. We’ll use the following example dataset of test scores to figure out what that means:
95、88、76、81、92、70、86、87、72
95, 88, 76, 81, 92, 70, 86, 87, 72
评估班级测试表现的一种方法是通过将所有分数之和除以分数数量来计算平均分数。总分是747,有9个分数,平均分是83分。
One way to assess a class’s performance on the test is to calculate the average score by taking the sum of all the scores divided by the number of scores. The sum is 747, and there are 9 scores, giving us an average of 83.
总的来说,测试分数是来自神话父进程的样本,该父进程为所进行的特定测试生成测试分数。这不是考虑测试分数的常见方法,但它是一种考虑数据集代表什么的机器学习方法。另一组学生的测试分数代表来自该测试的家长流程的另一个样本。如果我们有许多班级的测试分数,我们就可以非常有把握地了解真实的平均测试分数,或者至少是我们期望找到该平均分数的范围。
Collectively, the test scores are a sample from a mythical parent process that generates test scores for the particular test taken. This isn’t a common way to think about test scores, but it’s a machine learning way to think about what a dataset represents. The test scores from another group of students represent another sample from the parent process for this test. If we have many classes’ worth of test scores, we can get an idea about the true average test score, or at least the range over which we expect to find that average score, with a high degree of confidence.
我们可以对许多不同的班级进行测试,以获得多个平均分数,每个班级一个,但我们将使用装袋从我们现有的测试分数集合中创建新的数据集,并查看它们的平均值。为此,我们从测试分数集合中随机选择值,而不关心我们是否已经选择了这个特定分数或从未选择过那个分数。以下是六个此类引导数据集:
We could give the test to many different classes to get multiple average scores, one per class, but instead we’ll use bagging to create new datasets from the collection of test scores we do have and look at their averages. To do that, we pick values from the collection of test scores at random, not caring if we’ve already picked this particular score or never pick that one. Here are six such bootstrapped datasets:
各自的平均值分别为 83.4%、86.2%、80.0%、81.7%、84.6% 和 83.4%。最低为80.0%,最高为86.2%。这使我们有理由相信大量样本将产生或多或少在该范围内的平均值。
The respective averages of each are 83.4, 86.2, 80.0, 81.7, 84.6, and 83.4 percent. The lowest is 80.0 percent, and the highest is 86.2 percent. This gives us some reason to believe that a large number of samples will produce an average more or less in that range.
这就是统计学家使用装袋的方式。对我们来说,关键部分是从原始数据集引导的六个新数据集。在生长随机森林时,每次需要新的决策树时,我们都会首先使用装袋来生成新的数据集,然后使用该数据集而不是原始数据集来训练决策树。请注意,六个数据集中的许多数据集都有重复的值。例如,数据集 1 使用了 81 和 87 两次,但从未使用过 72。给定数据集的这种随机化有助于创建行为彼此不同但与原始数据集表示的内容一致的决策树。
This is how a statistician might use bagging. For us, the critical part is the six new datasets bootstrapped from the original dataset. When growing a random forest, every time we need a new decision tree, we’ll first use bagging to produce a new dataset, then train the decision tree using that dataset, not the original. Notice that many of the six datasets have repeated values. For example, dataset 1 used both 81 and 87 twice, but never 72. This randomization of the given dataset helps create decision trees that behave differently from one another yet are aligned with what the original dataset represents.
随机森林使用的第二个技巧是根据随机选择的一组特征来训练决策树。让我们使用表 3-3中的玩具数据集来理解这意味着什么。与往常一样,每一行都是一个特征向量,即我们知道正确类标签的样本。列是每个样本的该特征的值。
The second trick a random forest uses is to train the decision tree on a randomly selected set of features. Let’s use the toy dataset in Table 3-3 to understand what that means. As always, each row is a feature vector, a sample for which we know the proper class label. The columns are the values of that feature for each sample.
表 3-3:玩具数据集
Table 3-3: A Toy Dataset
# # |
x 0 x0 |
x 1 x1 |
x 2 x2 |
x 3 x3 |
x 4 x4 |
x 5 x5 |
|---|---|---|---|---|---|---|
1 1 |
0.52 0.52 |
0.95 0.95 |
0.81 0.81 |
0.78 0.78 |
0.97 0.97 |
0.36 0.36 |
2 2 |
0.89 0.89 |
0.37 0.37 |
0.66 0.66 |
0.55 0.55 |
0.75 0.75 |
0.45 0.45 |
3 3 |
0.49 0.49 |
0.98 0.98 |
0.49 0.49 |
0.39 0.39 |
0.42 0.42 |
0.24 0.24 |
4 4 |
0.43 0.43 |
0.51 0.51 |
0.90 0.90 |
0.78 0.78 |
0.19 0.19 |
0.22 0.22 |
5 5 |
0.51 0.51 |
0.16 0.16 |
0.11 0.11 |
0.48 0.48 |
0.34 0.34 |
0.54 0.54 |
6 6 |
0.48 0.48 |
0.99 0.99 |
0.62 0.62 |
0.58 0.58 |
0.72 0.72 |
0.42 0.42 |
7 7 |
0.80 0.80 |
0.84 0.84 |
0.72 0.72 |
0.26 0.26 |
0.93 0.93 |
0.23 0.23 |
8 8 |
0.50 0.50 |
0.70 0.70 |
0.13 0.13 |
0.35 0.35 |
0.96 0.96 |
0.82 0.82 |
9 9 |
0.70 0.70 |
0.54 0.54 |
0.62 0.62 |
0.72 0.72 |
0.14 0.14 |
0.53 0.53 |
这个数据集代表什么?我不知道;它是编造出来的。我厚颜无耻的回答很好地提醒我们,机器学习模型不理解它们的数据集代表什么。他们处理没有上下文的数字。是像素值吗?房子有多少平方英尺?一个县每10万人的犯罪率是多少?这对于机器学习模型来说并不重要——这只是数字。
What does this dataset represent? I have no idea; it’s made up. My cheeky answer is a good reminder that machine learning models don’t understand what their datasets represent. They process numbers without context. Is it a pixel value? The number of square feet in a house? The crime rate of a county per 100,000 people? It doesn’t matter to the machine learning model—it’s all just numbers.
该玩具数据集由九个特征向量组成,每个特征向量有六个特征,即x 0到x 5。森林的决策树使用随机选择的六个特征的子集。例如,假设我们随机保留特征x 0、x 4和x 5。表 3-4显示了现在用于训练决策树的数据集。
This toy dataset consists of nine feature vectors, each with six features, x0 through x5. The forest’s decision trees use a randomly selected subset of the six features. For example, say we randomly keep features x0, x4, and x5. Table 3-4 shows the dataset now used to train the decision tree.
表 3-4:特征的随机集合
Table 3-4: A Random Collection of Features
# # |
x 0 x0 |
x 4 x4 |
x 5 x5 |
|---|---|---|---|
1 1 |
0.52 0.52 |
0.97 0.97 |
0.36 0.36 |
2 2 |
0.89 0.89 |
0.75 0.75 |
0.45 0.45 |
3 3 |
0.49 0.49 |
0.42 0.42 |
0.24 0.24 |
4 4 |
0.43 0.43 |
0.19 0.19 |
0.22 0.22 |
5 5 |
0.51 0.51 |
0.34 0.34 |
0.54 0.54 |
6 6 |
0.48 0.48 |
0.72 0.72 |
0.42 0.42 |
7 7 |
0.80 0.80 |
0.93 0.93 |
0.23 0.23 |
8 8 |
0.50 0.50 |
0.96 0.96 |
0.82 0.82 |
9 9 |
0.70 0.70 |
0.14 0.14 |
0.53 0.53 |
森林中的每个决策树都已仅使用可用特征的子集在数据集的引导版本上进行了训练。我们已经两次使用随机性来种植一片树木,这些树木在训练的数据和关注的特征方面都略有不同。
Each decision tree in the forest has been trained on a bootstrapped version of the dataset using only a subset of the available features. We’ve used randomness twice to grow a forest of trees that are all subtly different from each other, in both what data they’re trained on and which features they pay attention to.
现在我们有了一片森林,我们该如何利用它呢?输入三个部分中的最后一个:合奏。从音乐上来说,合奏团是指演奏不同乐器的音乐家的集合。随机森林也是一个合奏,每个决策树都有不同的音乐家演奏不同的乐器。
Now that we have a forest, how do we use it? Enter the last of the three pieces: ensembling. Musically, an ensemble is a collection of musicians playing diverse instruments. The random forest is also an ensemble, with each decision tree a different musician playing a different instrument.
乐团通过组合每种乐器演奏的音符来产生单一输出,即音乐。同样,随机森林通过组合每个决策树生成的标签(通常通过像k最近邻分类器一样进行投票)来生成单个输出(类标签)。我们将获胜标签分配给输入。
A musical ensemble produces a single output, the music, by combining the notes played by each instrument. Likewise, a random forest produces a single output, a class label, by combining the labels produced by each decision tree, typically by voting like a k-nearest neighbors classifier. We assign the winning label to the input.
例如,如果我们想使用随机森林对样本 X 进行分类,并且随机森林中有 100 棵树(已经训练过),我们给每个树样本 X。树知道使用样本 X 的哪些特征子集来得出在一片带有标签的叶子上。我们现在有 100 个可能的类标签,这是森林 100 个决策树的输出。如果 78 棵树将样本 X 分配给 Y 类,则随机森林将样本 X 声明为 Y 类的实例。
For example, if we want to use the random forest to classify sample X, and there are 100 trees in the random forest (already trained), we give each tree sample X. The trees know which subsets of sample X’s features to use to arrive at a leaf with a label. We now have 100 possible class labels, the output from the forest’s 100 decision trees. If 78 of the trees assign sample X to class Y, the random forest proclaims sample X to be an instance of class Y.
将特征随机分配给树,结合引导数据集和整体投票,赋予了随机森林其力量。集成是一个直观上有吸引力的想法,并不局限于随机森林。没有什么可以阻止我们在同一数据集上训练多种模型类型,然后以某种方式组合它们的预测,以得出有关输入样本的联合结论。每个模型都有自己的优点和缺点。当结合起来时,优势往往会提高输出质量,使总和大于各个部分。
The random assignment of features to trees, combined with bootstrapped datasets and ensemble voting, gives a random forest its power. Ensembling is an intuitively attractive idea that isn’t restricted to random forests. Nothing stops us from training multiple model types on the same dataset and then combining their predictions in some way to arrive at a joint conclusion about an input sample. Each of the models will have its own strengths and weaknesses. When combined, the strengths tend to enhance the output quality, making the sum greater than the parts.
我们还有另一种经典的机器学习模型需要研究,即支持向量机(SVM)。之后,我们将使模型相互比较,以获得关于它们行为方式的直觉,并提供一个基线,我们可以根据该基线来比较神经网络的性能。
We have one more classical machine learning model to investigate, the support vector machine (SVM). After that, we’ll pit the models against each other to gain intuition about how they behave and provide a baseline against which we can compare the performance of neural networks.
****
****
理解支持向量机就是理解四个概念:边距、支持向量、优化和核。即使对于数学专业的人来说,数学也有点复杂,但我们将把它放在一边,专注于获得概念性的理解。
To understand support vector machines is to understand four concepts: margins, support vectors, optimization, and kernels. The math is a bit hairy, even for math people, but we’ll set that aside and focus instead on gaining a conceptual understanding.
支持向量机最好通过视觉来理解,因此我们将从图 3-3中的示例玩具数据集开始。这是一个二类数据集(圆形和正方形),具有二维特征向量、特征x 0和x 1。
Support vector machines are best understood visually, so we’ll begin with the example toy dataset in Figure 3-3. This is a two-class dataset (circles and squares) with two-dimensional feature vectors, features x0 and x1.
图 3-3:具有两个特征 x 0和 x 1 的二类玩具数据集
Figure 3-3: A two-class toy dataset with two features, x0 and x1
该数据集的分类器构建起来很简单,因为一条线可以轻松地按类别分隔数据集,其上方和右侧的所有正方形以及下方和左侧的所有圆圈。但它应该去哪里呢?我们可以使用无数行。例如,我们可能会穿过所有方块下方的线。该线将类分开,但是如果我们在使用分类器时遇到来自类正方形的样本恰好位于该线下方,我们就会犯一个错误,并将该样本分配给类圆,因为它位于我们声明的分隔类的线下方。类似地,如果我们将这条线放在所有圆圈的上方,我们可能会称一个实际上是圆圈的新样本为正方形,因为它稍微高于该线。
A classifier for this dataset is straightforward to construct because a line easily separates the dataset by class, with all the squares above it and to the right and all the circles below and to the left. But where should it go? There are an infinite number of lines that we might use. For example, we might pass the line just below all the squares. That line separates the classes, but if we encounter a sample from class square that lands just below the line when we use the classifier, we’ll make a mistake and assign the sample to class circle because it’s below the line we declared separates the classes. Similarly, if we place the line just above all the circles, we might call a new sample that’s actually a circle a square because it landed slightly above that line.
根据我们根据训练数据所知,我们应该将分隔线放置在距离每个组尽可能远的地方。这就是边际概念发挥作用的地方。 SVM 寻求最大化两个组之间的间隔,这意味着找到类之间分离最宽的地方。当他们具有最大边距时,他们将边界(这里是一条线)放置在边距的中间,因为根据训练数据中包含的信息,这是最明智的做法。
Given what we know based on the training data, we should place the separating line as far from each group as possible. Here’s where the concept of a margin comes into play. SVMs seek to maximize the margin between the two groups, meaning finding the place with the widest separation between classes. When they have the maximum margin, they place the boundary, here a line, in the middle of the margin because that’s the most sensible thing to do based on the information contained in the training data.
图 3-4显示了带有另外三行的训练数据。虚线定义边距,粗实线标记 SVM 放置的边界,以最大化类之间的距离。这是生产线的最佳位置,可以最大限度地减少两个类别之间的标签错误。简而言之,这就是 SVM 所做的全部工作。
Figure 3-4 shows the training data with three additional lines. The dashed lines define the margin, and the heavy continuous line marks the boundary placed by the SVM to maximize the distance between classes. This is the best position for the line to minimize labeling errors between the two classes. This, in a nutshell, is all an SVM does.
图 3-4:最大边距分隔线(粗线)和最大边距(虚线)
Figure 3-4: The maximal margin separating line (heavy) and maximum margins (dashed)
SVM 的其他三个部分(支持向量、优化和内核)用于查找边距和分隔线。在图 3-4中,请注意虚线穿过一些数据点。这些点是算法找到的用于定义边距的支持向量。这些支持向量从哪里来?回想一下,图中的点代表训练集中的特定特征向量。支持向量是通过优化算法找到的训练集的成员。优化涉及根据某些标准找到最好的东西。 SVM 使用的优化算法定位定义最大边距并最终定义分隔线的支持向量。在第一章中,我们在讨论将数据拟合到曲线时使用了优化算法,并且在训练神经网络时我们将再次使用优化算法。
The three other parts of an SVM—support vectors, optimization, and kernels—are used to find the margins and the separating line. In Figure 3-4, notice that the dashed lines pass through some of the data points. These points are the support vectors that the algorithm finds to define the margin. Where do those support vectors come from? Recall that the figure’s points represent specific feature vectors in the training set. Support vectors are members of the training set found via an optimization algorithm. Optimization involves finding the best of something according to some criteria. The optimization algorithm used by an SVM locates the support vectors that define the maximum margin and, ultimately, the separating line. In Chapter 1, we used an optimization algorithm when we discussed fitting data to a curve, and we’ll use one again when training neural networks.
我们快到了;我们只剩下一个 SVM 概念:核。与爆米花或计算机操作系统核心的内核相反,数学内核涉及两件事——这里是两个特征向量。图 3-4中的示例使用线性内核,这意味着它按原样使用训练数据特征向量。支持向量机允许多种内核来关联两个特征向量,但线性内核是最常见的。另一种称为高斯核(或者更详细、更令人印象深刻的是径向基函数核),通常在线性核失败的情况下有所帮助,因为特征向量彼此之间存在不同类型的关系。
We’re almost there; we have only one SVM concept remaining: kernels. As opposed to the popcorn variety or the kernel at the heart of your computer’s operating system, mathematical kernels relate two things—here, two feature vectors. The example in Figure 3-4 uses a linear kernel, meaning it uses the training data feature vectors as they are. Support vector machines admit many kinds of kernels to relate two feature vectors, but the linear kernel is the most common. Another kind, called a Gaussian kernel (or, even more verbose and impressive, a radial basis function kernel), often helps in situations where the linear kernel fails because the feature vectors are in a different kind of relationship to each other.
内核将特征向量转换为不同的表示形式,这是卷积神经网络的核心思想。导致经典机器学习长期陷入困境的问题之一是,提供给模型的数据的原始形式过于复杂,模型无法在类之间做出有意义的区分。这和想法有关我们在最近邻的讨论中引入了流形和内在维数。
The kernel transforms the feature vectors into a different representation, an idea central to what convolutional neural networks do. One of the issues that made classical machine learning stumble for so long is that the data supplied to the models was too complex in its raw form for the model to make meaningful distinctions between classes. This is related to the idea of manifolds and intrinsic dimensionality introduced in our discussion of nearest neighbors.
经典机器学习实践者花费了大量精力试图最大限度地减少模型所需的特征数量,将特征削减到模型区分类别所需的最小集合。这种方法被称为特征选择或降维,具体取决于所使用的算法。类似地,特别是对于支持向量机,他们使用内核将给定的特征向量映射到新的表示,从而使分类变得更容易。这些方法是人类主导的努力;我们选择这些功能或内核是希望它们能让问题更容易管理。但是,正如我们将了解到的,现代深度学习在学习数据所包含信息的新表示时让数据自己说话。
Classical machine learning practitioners spent considerable effort trying to minimize the number of features needed by a model, paring the features down to the minimal set necessary for the model to distinguish between classes. This approach was termed feature selection or dimensionality reduction, depending on the algorithm used. Similarly, especially with SVMs, they used kernels to map the given feature vectors to a new representation, making separating classes easier. These approaches were human-led endeavors; we selected the features or the kernels in the hopes that they’d make the problem more manageable. But, as we’ll learn, modern deep learning lets the data speak for itself when learning new representations of the information the data contains.
在实践中,训练支持向量机意味着为与所使用的内核相关的参数找到合适的值。如果内核是线性的,如前面的示例所示,则只能找到一个值,通常称为C。它是一个数字,例如 1 或 10,影响支持向量机的性能。如果使用高斯核,我们有C和另一个参数,用希腊字母γ (gamma) 表示。训练 SVM 的艺术包括找到最适合手头数据集的神奇值。
In practice, training a support vector machine means locating good values for the parameters related to the kernel used. If the kernel is linear, as in the previous example, there’s only one value to find, universally called C. It’s a number, like 1 or 10, affecting how well the support vector machine performs. If using the Gaussian kernel, we have C and another parameter, known by the Greek letter γ (gamma). The art of training an SVM involves finding the magic values that work best for the dataset at hand.
模型使用的神奇值是它的超参数。神经网络有很多超参数;甚至比 SVM 还要多。然而,我的经验告诉我,调整神经网络(尤其是现代深度神经网络)通常比支持向量机更容易。我在这里坦率地承认我的偏见;其他人可能不同意。
The magic values used by a model are its hyperparameters. Neural networks have many hyperparameters; even more than SVMs. However, my experience has taught me that it’s often easier to tune a neural network—especially a modern deep neural network—than a support vector machine. I freely confess my bias here; others might disagree.
支持向量机在数学上是优雅的,从业者利用这种优雅来调整超参数和所使用的内核,以及一套老式的数据准备方法,来构建一个性能良好的模型,该模型可以很好地处理野外数据。这个过程的每一步都依赖于构建模型的人的直觉和经验。如果他们知识渊博、经验丰富,那么如果数据集适合这样的模型,他们很可能会成功,但成功并不能保证。另一方面,深度神经网络很大,有点笨重,并且其生死取决于它们所输入的原始数据。也就是说,通过用最少的假设集来解决问题,神经网络可以概括人类无法理解的数据集元素,我认为这通常就是现代神经网络能够完成以前认为几乎不可能的事情的原因。
Support vector machines are mathematically elegant, and practitioners use that elegance to tweak the hyperparameters and the kernel used, along with a suite of old-school data preparation approaches, to construct a well-performing model that works well on data in the wild. Every step of this process relies on the intuition and experience of the human building the model. If they’re knowledgeable and experienced, they’ll likely succeed if the dataset is amenable to such a model, but success isn’t assured. On the other hand, deep neural networks are big, kind of clunky, and live or die by the raw data they’re fed. That said, by coming to the problem with a minimal set of assumptions, neural networks can generalize over elements of the dataset that humans cannot fathom, which I think is often why modern neural networks can do what was previously believed to be next to impossible.
SVM 是二元分类器:它们区分两个类,如图3-3中的数据集所示。但有时我们需要区分两个以上的类。我们如何使用 SVM 来做到这一点?
SVMs are binary classifiers: they distinguish between two classes, as in the dataset in Figure 3-3. But sometimes we need to distinguish between more than two classes. How can we do that with an SVM?
我们有两种选择将 SVM 推广到多类问题。假设数据集中有 10 个类。第一种泛化方法训练 10 个 SVM,其中第一个尝试将 0 类与其他 9 类分开。第二个同样尝试将第 1 类与其余 9 个类分开,依此类推,为我们提供了一组模型,每个模型都试图将一个类与所有其他类分开。为了对未知样本进行分类,我们将样本提供给每个 SVM 并返回该模型的类标签产生最大的决策函数值——SVM用来决定其输出置信度的度量或测量。此选项称为一对一或一对多。它会训练与类一样多的 SVM。
We have two options for generalizing SVMs to multiclass problems. Assume we have 10 classes in the dataset. The first generalization approach trains 10 SVMs, the first of which attempts to separate class 0 from the other nine classes. The second likewise attempts to separate class 1 from the remaining nine, and so on, giving us a collection of models, each trying to separate one class from all the others. To classify an unknown sample, we give the sample to each SVM and return the class label of the model that produced the largest decision function value—the metric, or measurement, the SVM uses to decide its confidence in its output. This option is known as one-versus-rest or one-versus-all. It trains as many SVMs as there are classes.
另一种选择是一对一,它为每个可能的类对训练一个单独的 SVM。为每个模型提供未知样本,并为其分配最常出现的类标签。如果班级数量太大,一对一就不切实际。例如,对于 CIFAR-10 中的 10 个类,我们需要 45 台不同的 SVM 机器。如果我们对 ImageNet 数据集中的 1,000 个类尝试这种方法,我们将等待很长时间才能训练 499,500 个不同的 SVM。
The other option is one-versus-one, which trains a separate SVM for each possible pair of classes. The unknown sample is given to each model, and the class label that shows up most often is assigned to it. One-versus-one isn’t practical if the number of classes becomes too large. For example, for the 10 classes in CIFAR-10, we’d need 45 different SVM machines. And if we tried this approach with the 1,000 classes in the ImageNet dataset, we’d be waiting a long time for the 499,500 different SVMs to train.
支持向量机非常适合 20 世纪 90 年代和 2000 年代初普遍使用的计算能力,这就是它们长期以来阻碍神经网络发展的原因。然而,随着深度学习的出现,没有理由诉诸 SVM(我认为)。
Support vector machines were well suited to the computing power commonly available in the 1990s and early 2000s, which is why they held neural networks at bay for so long. However, with the advent of deep learning, there’s little reason to resort to an SVM (in my opinion).
****
****
让我们使用由恐龙足迹轮廓组成的开源数据集来测试本章中探讨的三个经典模型,该数据集来自 Jens N. Lallensack、Anthony Romilio 于 2022 年发表的论文“用于区分兽脚亚目和鸟臀目恐龙足迹的机器学习方法”,和彼得·L·法金厄姆。足迹图像是根据 Creative Commons CC BY 4.0 许可证发布的,该许可证允许在归属情况下重复使用。
Let’s test the three classical models explored in this chapter using an open source dataset consisting of dinosaur footprint outlines that comes from the 2022 paper “A Machine Learning Approach for the Discrimination of Theropod and Ornithischian Dinosaur Tracks” by Jens N. Lallensack, Anthony Romilio, and Peter L. Falkingham. The footprint images were released under the Creative Commons CC BY 4.0 license, which allows reuse with attribution.
图 3-5包含数据集中的示例。兽脚类足迹(想想霸王龙)位于顶行,鸟臀目足迹(想想像鸭嘴龙这样的鸭嘴恐龙)位于底部。模型使用的图像被反转为黑色背景上的白色,重新缩放为 40×40 像素,并分解为 1,600 维向量。按照现代标准来看,该数据集很小,有 1,336 个训练样本和 335 个测试样本。
Figure 3-5 contains samples from the dataset. Theropod footprints (think T. rex) are in the top row, and ornithischian footprints (think duckbilled dinos like hadrosaurs) are at the bottom. The images used by the models were inverted to be white on a black background, rescaled to 40×40 pixels, and unraveled to become 1,600-dimensional vectors. The dataset is small by modern standards, with 1,336 training samples and 335 test samples.
图 3-5:兽脚亚目(上)和鸟臀目(下)足迹
Figure 3-5: Theropod (top) and ornithischian (bottom) footprints
我训练了以下模型:
I trained the following models:
训练结束后,我使用保留的测试集测试了模型。我还计算了训练每个模型以及训练后测试每个模型所需的时间。训练后使用模型就是推理,这意味着我跟踪了测试集上的推理时间。
After training, I tested the models with the held-out test set. I also timed how long it took to train each model and to test each model after training. Using a model after training is inference, meaning I tracked the inference time on the test set.
笔记
NOTE
这不是一本编程书籍,但如果您熟悉编程,尤其是 Python,请随时通过 rkneuselbooks@gmail.com 与我联系,我会将数据集和代码发送给您。
This isn’t a programming book, but if you’re familiar with programming, especially Python, feel free to contact me at rkneuselbooks@gmail.com and I’ll send you the dataset and code.
结果如表 3-5所示。正如您所料,评估模型的工作效果是机器学习过程的关键组成部分。
Table 3-5 shows the results. Evaluating how well a model works is, as you might expect, a critical component of the machine learning process.
表 3-5:恐龙足迹分类
Table 3-5: Classifying Dinosaur Footprints
模型 Model |
ACC ACC |
中冶集团 MCC |
火车 Train |
测试 Test |
|---|---|---|---|---|
RF300 RF300 |
83.3 83.3 |
0.65 0.65 |
1.5823 1.5823 |
0.0399 0.0399 |
RBF支持向量机 RBF SVM |
82.4 82.4 |
0.64 0.64 |
0.9296 0.9296 |
0.2579 0.2579 |
7-NN 7-NN |
80.0 80.0 |
0.58 0.58 |
0.0004 0.0004 |
0.0412 0.0412 |
3-NN 3-NN |
77.6 77.6 |
0.54 0.54 |
0.0005 0.0005 |
0.0437 0.0437 |
1-NN 1-NN |
76.1 76.1 |
0.50 0.50 |
0.0004 0.0004 |
0.0395 0.0395 |
线性支持向量机 Linear SVM |
70.7 70.7 |
0.41 0.41 |
2.8165 2.8165 |
0.0007 0.0007 |
左侧第一列标识模型:从上到下,随机森林,径向基函数支持向量机,最近邻(有 7、3 和 1 个邻居)和线性支持向量机。
The first column on the left identifies the model: from top to bottom, random forest, radial basis function support vector machine, nearest neighbors (with 7, 3, and 1 neighbor), and linear support vector machine.
ACC 和 MCC 列是根据混淆矩阵计算得出的指标,混淆矩阵是机器学习从业者在评估模型时工具箱中最关键的部分(请参阅第 1 章)。对于像我们这里这样的二元分类器,混淆矩阵计算兽脚亚目测试样本被正确识别的次数,鸟臀类测试样本也是如此,以及一个被混淆的次数。
The ACC and MCC columns are metrics calculated from the confusion matrix, the single most crucial part of the machine learning practitioner’s toolbox when evaluating a model (see Chapter 1). For binary classifiers like the ones we have here, the confusion matrix counts the number of times a theropod test sample was correctly identified, the same for ornithischian test samples, and the number of times one was confused for the other.
从视觉上看,二元模型的混淆矩阵如下所示:
Visually, the confusion matrix for a binary model looks like this:
|
鸟臀目 Ornithischian |
兽脚类恐龙 Theropod |
|---|---|---|
鸟臀目 Ornithischian |
总氮 TN |
FP FP |
兽脚类恐龙 Theropod |
纤维网 FN |
TP TP |
这些行是保留的测试集中的实际类标签。列是模型分配的标签。单元格是实际标签和模型分配标签的每种组合发生的次数的计数。这些字母是表示单元格中数字含义的标准方式:TN 为真阴性,TP 为真阳性,FP 为假阳性,FN 为假阴性。对于恐龙足迹模型,兽脚类动物属于 1 类,即“正”类,而鸟臀目属于 0 类,即“负”类。
The rows are the actual class label from the held-out test set. The columns are the labels assigned by the models. The cells are the counts of the number of times each combination of actual label and model-assigned label happened. The letters are the standard way to refer to what the numbers in the cells mean: TN is true negative, TP is true positive, FP is false positive, and FN is false negative. For the dinosaur footprint models, theropod is class 1, the “positive” class, making ornithischian class 0, or the “negative” class.
模型将鸟臀类足迹称为“鸟臀类”的次数就是 TN 计数。同样,TP 计数代表模型关于兽脚亚目足迹的正确次数。目标是得到TN和TP尽可能高,同时使FP和FN、错误尽可能低。
The number of times the model called an ornithischian footprint “ornithischian” is the TN count. Similarly, the TP count represents the number of times the model was right about a theropod footprint. The goal is to get TN and TP as high as possible while making FP and FN, the mistakes, as low as possible.
在表3-5中,ACC指的是准确度:分类器分配的标签有多少次是正确的?虽然准确性是最自然需要考虑的指标,但它并不总是最好的,特别是在每个类的示例数量不几乎相等的情况下。随机森林在准确度方面表现最好,每 100 张测试图像中就有超过 83 张被正确标记。线性 SVM 最差; 100 次中只有 71 次是正确的。随机猜测的正确率约为 50%,因为我们有两个类,所以即使是线性 SVM 也是从足迹图像中学习的。我们通过添加 TP 和 TN 并将该总和除以所有四个单元格的总和来定义混淆矩阵单元格的准确性。
In Table 3-5, ACC refers to the accuracy: how many times was the classifier’s assigned label correct? While accuracy is the most natural metric to consider, it isn’t always the best, especially if the number of examples per class isn’t nearly equal. The random forest performed the best in terms of accuracy, correctly labeling more than 83 out of every 100 test images. The linear SVM was the worst; it was right only about 71 times out of 100. Random guessing would be correct about 50 percent of the time because we have two classes, though, so even the linear SVM was learning from the footprint images. We define the accuracy in terms of the cells of the confusion matrix by adding TP and TN and dividing that sum by the sum of all four cells.
MCC 列(代表Matthews 相关系数)引入了一个新指标。它是混淆矩阵中四个数字的不同组合。 MCC 是我最喜欢的分类器指标,它越来越被认为是衡量模型性能的最佳单一数字指标。 (这些指标也适用于更高级的深度学习模型。)表 3-5按 MCC 排序,在本例中,MCC 也恰好按 ACC 排序。对于二元模型,最低可能的 MCC 为 –1,最高为 1。随机猜测给出的 MCC 为 0。MCC 为 1 表示模型没有错误。 MCC 为 –1,这在实践中从未真正发生过,意味着该模型是完全错误的:在我们的例子中,它将标记所有兽脚亚目鸟类足迹和所有鸟臀目兽脚亚目足迹。如果您的分类器完全错误,请交换输出标签以使其完全正确。
The MCC column, which stands for Matthews correlation coefficient, introduces a new metric. It’s a different combination of the four numbers in the confusion matrix. MCC is my favorite metric for classifiers, and it is increasingly understood to be the best single-number measure of how well a model performs. (These metrics apply to more advanced deep learning models as well.) Table 3-5 is sorted by MCC, which, for this example, also happens to sort by ACC. For a binary model, the lowest possible MCC is –1, and the highest is 1. Random guessing gives an MCC of 0. An MCC of 1 means the model makes no mistakes. An MCC of –1, which never actually happens in practice, means that the model is perfectly wrong: in our case, it would label all theropod tracks ornithischian and all ornithischian tracks theropod. If you have a perfectly wrong classifier, swap the output labels to make it perfectly right.
训练和测试列列出了以秒为单位的时间。训练列告诉我们在使用模型之前训练模型需要多长时间。最近邻模型几乎不需要时间,只需几分之一毫秒,因为没有任何东西可以训练。回想一下,最近邻模型就是训练集本身;没有模型可以以某种方式来近似数据。
The Train and Test columns list times in seconds. The Train column tells us how long it took to train the model before using it. The nearest neighbor models take virtually no time, a mere fraction of a millisecond, because there’s nothing to train. Recall that a nearest neighbor model is the training set itself; there is no model to condition to approximate the data in some way.
最慢的模型是线性 SVM。奇怪的是,更复杂的径向基函数模型的训练时间大约为三分之一(差异可以归因于此类模型在代码中的实现方式)。训练第二慢的模型是随机森林。这是有道理的,因为森林中有 300 棵决策树,每棵决策树都必须独立训练。
The slowest model was the linear SVM. Curiously, the more complex radial basis function model trained in roughly one-third the time (a difference that can be attributed to how such models are implemented in code). The next slowest model to train was the random forest. This makes sense because there were 300 decision trees in the forest, and each of them had to be trained independently.
在测试列中,最近邻模型和随机森林模型的推理时间大致相同。 SVM 模型分别较慢(RBF)和非常快(线性),再次反映了实现上的差异。请注意,最近邻模型的使用时间比训练时间长。这与通常的情况相反,特别是对于神经网络,我们将在本书后面看到。通常,训练速度很慢,但只需要进行一次,而推理速度很快。对于最近邻模型,训练集越大,推理时间就越慢——这对它们来说是一个重大打击。
The inference time, in the Test column, was roughly the same between the nearest neighbor and random forest models. The SVM models were respectively slow (RBF) and very fast (linear), again reflecting differences in the implementation. Notice that the nearest neighbor models take longer to use than to train. This is the reverse of the usual scenario, especially for neural networks, as we’ll see later in the book. Typically, training is slow but needs to be done only once, while inference is fast. For nearest neighbor models, the larger the training set, the slower the inference time—a significant strike against them.
从这个练习中我们可以学到两点:对经典模型性能的一般了解,我们将在第 4 章中将其用作比较神经网络的基线,并且即使是经典模型也可以做得很好在这个特定的数据集上。他们的表现与人类专家(即古生物学家)的表现相当,后者也标记了恐龙足迹的轮廓。根据 Lallensack 等人的原始论文。根据恐龙数据集,人类专家的正确率只有 57%。他们还被允许将曲目标记为“模棱两可”,这是模特们所没有的奢侈;模特们总是布置课堂作业,没有“我不知道”的选项。我们可以强制某些模型类型做出这样的陈述,但本章的经典模型不太适合这样做。
There are two main things to take away from this exercise: a general understanding of the performance of the classical models, which we’ll use as a baseline against which to compare a neural network in Chapter 4, and that even classical models can do well on this particular dataset. Their performance was on par with that of human experts (meaning paleontologists), who also labeled the dinosaur footprint outlines. According to the original paper by Lallensack et al. from which the dinosaur dataset was taken, the human experts were correct only 57 percent of the time. They were also allowed to label tracks as “ambiguous,” a luxury the models don’t have; the models always make a class assignment, with no “I don’t know” option. We can coerce some model types into making such statements, but the classical models of this chapter are not well suited to that.
****
****
经典模型是象征性人工智能还是联结主义?他们到底是人工智能吗?他们是在学习,还是只是数学技巧?我对这些问题的回答如下。
Are the classical models symbolic AI or connectionism? Are they AI at all? Do they learn, or are they merely mathematical tricks? My answers to these questions follow.
在第一章中,我将人工智能、机器学习和深度学习之间的关系描述为一系列嵌套概念,深度学习是机器学习的一种形式,机器学习是人工智能的一种形式(见图1-1)。对于大多数人来说,这是描述这种关系的正确方式,也符合第二章的历史。从这个角度来看,本章的经典模型是人工智能的一种形式。
In Chapter 1, I characterized the relationship between AI, machine learning, and deep learning as a series of nested concepts, with deep learning a form of machine learning and machine learning a form of AI (see Figure 1-1). This is the proper way to describe the relationship for most people, and it fits with Chapter 2’s history. From this perspective, the classical models of this chapter are a form of AI.
但经典模型是符号人工智能还是联结人工智能?我说两者都不是。它们不是符号人工智能,因为它们不操纵逻辑规则或语句,它们也不是联结主义,因为它们不采用简单单元网络,在处理数据时学习它们的正确关联。相反,我认为这些模型是曲线拟合的一种奇特形式——算法的输出,采用优化过程来生成最能表征训练数据的函数,并且希望能够表征模型在野外遇到的数据。
But are the classical models symbolic AI or connectionist AI? I say neither. They are not symbolic AI because they don’t manipulate logical rules or statements, and they’re not connectionist because they don’t employ a network of simple units that learn their proper association as they work with the data. Instead, I consider these models to be a fancy form of curve fitting—the output of an algorithm employing an optimization process to produce a function that best characterizes the training data, and, hopefully, the data encountered by the model in the wild.
对于支持向量机,函数是模型在优化过程中找到的支持向量的结构。决策树的函数是由特定算法生成的,该算法旨在重复地将训练数据分成越来越小的组,直到创建一个叶子(通常)仅包含来自单个类的示例。随机森林只是并行工作的此类函数的集合。
For a support vector machine, the function is the structure of the model in terms of the support vectors it locates during its optimization process. A decision tree’s function is generated by a specific algorithm designed to repeatedly split the training data into smaller and smaller groups until a leaf is created that (usually) contains only examples from a single class. Random forests are merely collections of such functions working in parallel.
树分类器几乎是遗传编程的一种形式。遗传编程通过自然选择模拟进化来创建计算机代码,其中改进的适应度对应于“是问题的更好解决方案”。事实上,遗传编程是一种进化算法,进化算法与群体智能算法一起实现鲁棒的通用优化。有些人认为进化算法和群体智能就是人工智能。我不这样做,尽管我在工作中经常使用它们。群体不会学习;他们搜索代表问题的可能解决方案的空间。
Tree classifiers are almost a form of genetic programming. Genetic programming creates computer code by simulating evolution via natural selection, where improved fitness corresponds to “is a better solution to the problem.” Indeed, genetic programming is a kind of evolutionary algorithm, and evolutionary algorithms, along with swarm intelligence algorithms, implement robust, generic optimization. Some people consider evolutionary algorithms and swarm intelligence to be AI. I don’t, though I frequently use them in my work. Swarms don’t learn; they search a space representing possible solutions to a problem.
最近邻模型更加简单;没有要创建的函数。如果我们拥有某个父进程生成的所有可能的数据,那么也就是说,创建我们要建模的特征向量的东西——那么我们就不需要模型了。要将类标签分配给特征向量,我们只需在特征向量“电话簿”中查找它并返回我们在那里找到的标签。由于我们拥有所有可能的带有标签的特征向量,因此没有什么可以近似的,并且在野外遇到的任何特征向量都必然在书中。
Nearest neighbor models are even simpler; there is no function to create. If we have all the possible data generated by some parent process—that is, the thing creating the feature vectors that we’re trying to model—then we don’t need a model. To assign a class label to a feature vector, we simply look it up in the feature vector “phone book” and return the label we find there. Since we have all possible feature vectors with labels, there’s nothing to approximate, and any feature vector encountered in the wild will necessarily be in the book.
除非访问当前问题的所有可能的特征向量,最近邻模型使用由训练数据表示的不完整电话簿中最接近的特征向量。
Barring access to all possible feature vectors for the problem at hand, a nearest neighbor model uses the closest feature vector in the incomplete phone book represented by the training data.
举个例子,假设我们住在一个有 3,000 人的小镇,所有的人都在电话簿中。 (还有电话簿之类的东西吗?如果没有,就假装。)
As an example, suppose we live in a town of 3,000 people, and all of them are in the phone book. (Are there still such things as phone books? If not, pretend.)
如果我们想找到 Nosmo King 的电话号码,我们可以在书中查找“King”下的内容并进行扫描,直到点击“Nosmo”,然后我们就找到了。然而,假设我们没有全部 3,000 人的完整列表,而是随机选择的 300 人。我们仍然想知道 Nosmo King 的电话号码(类别标签),但它不在电话簿中。然而,有一个伯格·R·金。由于姓氏相同,Burg 很可能与 Nosmo 有关系,因此我们将 Burg 的电话号码返回为 Nosmo 的电话号码。显然,电话簿越完整,我们找到所需姓名或该人家里的某个人的机会就越大。这基本上就是最近邻模型所做的全部事情。
If we want to find Nosmo King’s phone number, we look in the book under “King” and scan until we hit “Nosmo,” and we have it. Suppose, however, that we don’t have a complete listing of all 3,000 people, but 300 selected at random. We still want to know Nosmo King’s phone number (class label), but it’s not in the phone book. However, there is a Burg R. King. There’s a good chance Burg is related to Nosmo because of the shared last name, so we return Burg’s phone number as Nosmo’s. Clearly, the more complete the phone book, the better the chance we’ll find our desired name or someone in that person’s household. That’s essentially all that a nearest neighbor model does.
****
****
回顾一下,支持向量机、决策树和随机森林根据人类精心设计的算法使用数据生成函数。对我来说,这既不是象征性的人工智能,也不是联结主义,而是曲线拟合,或者更准确地说,是优化。最近邻模型的情况更糟;就他们而言,根本没有任何功能。
To recap, support vector machines, decision trees, and random forests use data to generate functions according to a carefully crafted algorithm designed by a human. That is neither symbolic AI nor connectionism to me, but curve fitting or, perhaps more accurately, optimization. Nearest neighbor models are even worse; in their case, there’s no function at all.
这并不意味着人工智能是假的,但它确实意味着从业者在谈论人工智能时所想到的可能与公众所认为的“人工智能”不同。
This doesn’t mean that AI is bogus, but it does mean that what practitioners have in mind when they talk about AI is likely different from what the general public considers “artificial intelligence.”
然而,一切并没有失去。有一种机器学习模型配得上联结主义的标签:神经网络。它是人工智能革命的核心,并且能够真正从数据中学习。因此,让我们把经典模型和符号人工智能放在一边,将注意力集中在神经网络上。
However, all is not lost. There is a machine learning model worthy of the connectionist label: the neural network. It’s at the heart of the AI revolution, and it’s capable of actually learning from data. So, let’s put classical models and symbolic AI aside and devote our attention to neural networks.
关键术语
KEY TERMS
bagging、维数诅咒、进化算法、假阴性、假阳性、遗传编程、超参数、推理、流形、度量、最近邻、一对一、一对休息、随机森林、支持向量机、群体智能, 真阴性, 真阳性
bagging, curse of dimensionality, evolutionary algorithm, false negative, false positive, genetic programming, hyperparameters, inference, manifold, metric, nearest neighbor, one-versus-one, one-versus-rest, random forest, support vector machine, swarm intelligence, true negative, true positive
联结主义试图提供一个可以产生智能的基础。今天,联结主义意味着神经网络,神经网络是对生物神经元的认可。然而,尽管有这个名字,两者之间的关系却很肤浅。生物神经元和人工神经元可能具有相似的配置,但它们以完全不同的方式运作。
Connectionism seeks to provide a substrate from which intelligence might emerge. Today, connectionism means neural networks, with neural being a nod to biological neurons. Despite the name, however, the relationship between the two is superficial. Biological neurons and artificial neurons may possess a similar configuration, but they operate in an entirely different manner.
生物神经元接受树突上的输入,当足够数量的输入处于活动状态时,它们会“放电”,在轴突上产生短暂的电压尖峰。换句话说,生物神经元在开启之前都是关闭的。大约八亿年的动物进化使这一过程变得更加复杂,但这就是本质。
Biological neurons accept input on their dendrites, and when a sufficient number of inputs are active they “fire” to produce a short-lived voltage spike on their axons. In other words, biological neurons are off until they’re on. Some 800 million years of animal evolution have made the process considerably more complex, but that’s the essence.
神经网络的人工神经元同样拥有输入和输出,但神经元不是放电,而是具有连续行为的数学函数。有些模型像生物神经元一样尖峰,但我们在本书中忽略了它们。推动人工智能革命的神经网络持续运行。
The artificial neurons of a neural network likewise possess inputs and outputs, but instead of firing, the neurons are mathematical functions with continuous behavior. Some models spike like biological neurons, but we ignore them in this book. The neural networks powering the AI revolution operate continuously.
将生物神经元想象成电灯开关。它会关闭,直到有理由(足够的输入)将其打开。生物神经元不会打开并保持打开状态,而是会闪烁,就像轻按开关一样。人工神经元类似于带有调光开关的灯。稍微转动开关即可产生少量光;进一步转动开关,灯的亮度按比例变化。这个类比并不在所有情况下都是准确的,但它传达了一个基本概念:人工神经元不是全有或全无。相反,它们根据某种函数产生与其输入成比例的输出。当我们学习本章时,迷雾将会消散,所以如果目前这没有什么意义,请不要担心。
Think of a biological neuron like a light switch. It’s off until there is a reason (sufficient input) to turn it on. The biological neuron doesn’t turn on and stay on but flashes on and off, like flicking the switch. An artificial neuron is akin to a light with a dimmer switch. Turn the switch a tiny amount to produce a small amount of light; turn the switch further, and the light’s brightness changes proportionally. This analogy isn’t accurate in all cases, but it conveys the essential notion that artificial neurons are not all or nothing. Instead, they produce output in proportion to their input according to some function. The fog will lift as we work through the chapter, so don’t worry if this makes little sense at present.
****
****
图4-1是本书中最关键的图。它也是最简单的方法之一,如果联结主义方法走在正确的轨道上,这是可以预料的。如果我们了解图 4-1所代表的内容及其运作方式,我们就拥有理解现代人工智能所需的核心理解。
Figure 4-1 is the most critical figure in the book. It’s also one of the simplest, as is to be expected if the connectionist approach is on the right track. If we understand what Figure 4-1 represents and how it operates, we have the core understanding necessary to make sense of modern AI.
图 4-1:简陋的(人工)神经元
Figure 4-1: The humble (artificial) neuron
图 4-1包含三个正方形、一个圆形、五个箭头以及“ x 0 ”和“输出”等标签。让我们从左边的方块开始依次检查每一个。
Figure 4-1 contains three squares, a circle, five arrows, and labels like “x0” and “Output.” Let’s examine each in turn, beginning with the squares on the left.
标准实践呈现的神经网络的输入在左侧,数据流在右侧。在图 4-1中,标记为x 0、x 1和x 2的三个方块是神经元的输入。它们是特征向量的三个特征,我们希望神经元处理这些特征,以便为我们提供导致类别标签的输出。
Standard practice presents neural networks with the inputs on the left and data flow to the right. In Figure 4-1, the three squares labeled x0, x1, and x2 are the inputs to the neuron. They are the three features of a feature vector, what we want the neuron to process to give us an output leading to a class label.
该圆圈标记为h ,这是激活函数的标准符号。激活函数的工作是接受神经元的输入并产生输出值,箭头指向图 4-1 的右侧。
The circle is labeled h, a standard notation for the activation function. The activation function’s job is to accept input to the neuron and produce an output value, the arrow heading off to the right in Figure 4-1.
三个输入方块通过箭头连接到圆(节点),每个输入方块都有一个箭头。箭头的标签 - w 0、w 1和w 2 - 是权重。神经元的每个输入都有一个相关的权重。孤独的b 通过箭头与圆圈相连的是偏差。它是一个数字,权重、输入x s 和输出也是一个数字。对于这个神经元,三个数字进来,一个数字出去。
The three input squares are connected to the circle (the node) by arrows, one from each input square. The arrows’ labels—w0, w1, and w2—are the weights. Every input to the neuron has an associated weight. The lone b linked to the circle by an arrow is the bias. It’s a number, as are the weights, the input xs, and the output. For this neuron, three numbers come in, and one number goes out.
神经元的运作方式如下:
The neuron operates like this:
这就是神经元所做的全部工作:它将输入乘以权重,对乘积求和,加上偏差值,并将总和传递给激活函数以产生输出。
That’s all a neuron does: it multiplies its inputs by the weights, sums the products, adds the bias value, and passes that total to the activation function to produce the output.
事实上,现代人工智能的所有出色成就都归功于这种原始的构造。将足够多的这些以正确的配置连接在一起,您就拥有了一个可以学习识别狗品种、驾驶汽车或将法语翻译成英语的模型。好吧,如果你有训练给我们的神奇权重和偏差值,你就可以做到。这些值对于神经网络非常重要,以至于一家公司采用“权重和偏差”作为其名称;请参阅https://www.wandb.ai。
Virtually all the fantastic accomplishments of modern AI are due to this primitive construct. String enough of these together in the correct configuration, and you have a model that can learn to identify dog breeds, drive a car, or translate from French to English. Well, you do if you have the magic weight and bias values, which training gives us. These values are so important to neural networks that one company has adopted “Weights & Biases” as its name; see https://www.wandb.ai.
我们可以选择激活函数,但在现代网络中,它最常见的是第 2 章中提到的修正线性单元 (ReLU) 。 ReLU 是一个问题:输入(输入之和乘以权重加上偏差)是否小于零?如果是,则输出为零;否则,无论输入是什么。
We have choices for the activation function, but in modern networks it’s most often the rectified linear unit (ReLU) mentioned in Chapter 2. The ReLU is a question: is the input (the sum of the inputs multiplied by the weights plus the bias) less than zero? If so, the output is zero; otherwise, it’s whatever the input is.
像孤立神经元这样简单的东西有用吗?它可以。作为实验,我使用第一章中鸢尾花数据集的三个特征作为输入来训练图 4-1中的神经元。回想一下,该数据集包含三种不同种类鸢尾花各部分的测量值。训练后,我使用包含 30 个特征向量的未使用测试集测试了神经元。该神经元正确分类了 28 个,准确率为 93%。
Can something as straightforward as a lone neuron be useful? It can. As an experiment, I trained the neuron in Figure 4-1 using three features from the iris flower dataset from Chapter 1 as input. Recall, this dataset contains measurements of the parts of three different species of iris. After training, I tested the neuron with an unused test set that had 30 feature vectors. The neuron correctly classified 28, for an accuracy of 93 percent.
我通过搜索一组三个权重和一个偏置值来训练神经元,生成一个输出,当四舍五入到最接近的整数时,与鸢尾花的类标签(0、1 或 2)相匹配。这不是训练神经网络的标准方法,但它适用于像单个神经元这样简单的东西。我们将在本章后面讨论标准网络训练。
I trained the neuron by searching for a set of three weights and a bias value producing an output that, when rounded to the nearest whole number, matched the class label for an iris flower—either 0, 1, or 2. This is not the standard way to train a neural network, but it works for something as modest as a single neuron. We’ll discuss standard network training later in the chapter.
单个神经元可以学习,但复杂的输入会阻碍它。复杂的输入意味着我们需要一个更复杂的模型。让我们给我们的单个神经元一些朋友。
A single neuron can learn, but complex inputs baffle it. Complex inputs imply we need a more complex model. Let’s give our single neuron some friends.
约定将神经元分层排列,前一层的输出作为下一层的输入。考虑图 4-2,它显示了输入后层中具有两个、三个和八个节点的网络。将网络分层排列可以简化代码的实现并促进标准训练过程。也就是说,如果可以找到训练模型的替代方法,则不需要使用层。
Convention arranges neurons in layers, with the outputs from the previous layer the inputs to the following layer. Consider Figure 4-2, which shows networks with two, three, and eight nodes in the layer after the input. Arranging the network in layers simplifies the implementation in code and facilitates the standard training procedure. That said, there is no requirement to use layers if an alternative way to train the model can be found.
图 4-2:二节点、三节点和八节点网络
Figure 4-2: Two-, three-, and eight-node networks
让我们从左上角的两节点网络开始。三个输入(正方形)在那里,但这次中间层有两个圆圈,右侧有一个圆圈。输入完全连接到中间层的两个节点,这意味着一条线将每个输入方块连接到每个中间层节点。中间层输出连接到最右侧的单个节点,网络的输出来自该节点。
Let’s begin with the two-node network at the upper left. The three inputs (squares) are there, but this time there are two circles in the middle layer and a single circle on the right. The inputs are fully connected to the two nodes in the middle layer, meaning a line connects each input square to each middle layer node. The middle layer outputs are connected to a single node on the far right, from which the network’s output comes.
神经网络左侧输入和右侧输出之间的中间层称为隐藏层。例如,图 4-2的网络各有一个隐藏层,分别有 2、3 和 8 个节点。
The middle layers of a neural network between the input on the left and the output on the right are known as hidden layers. For example, the networks of Figure 4-2 each have one hidden layer with 2, 3, and 8 nodes, respectively.
具有此配置的网络适用于二元分类任务(类别 0 与类别 1),其中输出是一个数字,表示模型认为输入是类别 1 的成员。因此,最右边的节点使用不同的激活函数称为sigmoid(也称为逻辑)。 sigmoid 产生 0 到 1 之间的输出。这也是用来表示概率的范围,因此很多人将具有 sigmoid 激活函数的节点的输出称为概率。这通常并不准确,但我们可以容忍这种马虎。隐藏层的节点均使用ReLU激活函数。
A network with this configuration is suitable for a binary classification task, class 0 versus class 1, where the output is a single number representing the model’s belief that the input is a member of class 1. Therefore, the rightmost node uses a different activation function known as a sigmoid (also called a logistic). The sigmoid produces an output between 0 and 1. This is also the range used to represent a probability, so many people refer to the output of a node with a sigmoid activation function as a probability. This is not generally accurate, but we can live with the sloppiness. The nodes of the hidden layer all use ReLU activation functions.
为了实现图 4-2中的两节点网络,我们必须学习多少个权重和偏差?我们需要为每条线(输出箭头除外)分配一个权重,并为每个节点分配一个偏差值。因此,我们需要八个权重和三个偏差值。对于左下角的模型,我们需要 12 个权重和 4 个偏差。最后,对于8节点模型,我们需要学习32个权重和9个偏差值。随着层中节点数量的增加,权重数量增加得更快。仅这一事实就限制了神经网络多年,因为潜在有用的模型对于单个计算机的内存来说太大了。当然,模型大小是相对的。 OpenAI 的 GPT-3 拥有超过 1750 亿个权重,虽然他们没有谈论 GPT-4 有多大,但有传言称其权重为 1.7万亿个。
How many weights and biases must we learn to implement the two-node network in Figure 4-2? We need one weight for each line (except the output arrow) and one bias value for each node. Therefore, we need eight weights and three bias values. For the model at the lower left, we need 12 weights and 4 biases. Finally, for the 8-node model, we need to learn 32 weights and 9 bias values. As the number of nodes in a layer increases, the number of weights increases even faster. This fact alone restrained neural networks for years, as potentially useful models were too big for a single computer’s memory. Of course, model size is relative. OpenAI’s GPT-3 has over 175 billion weights, and while they aren’t talking about how large GPT-4 is, rumor puts it at 1.7 trillion weights.
我们需要一个二类数据集来探索图 4-2中的模型。我们将使用的数据集是一个经典的数据集,试图区分两个数据集在意大利特定地区用于酿造葡萄酒的葡萄品种。不幸的是,数据集代表的葡萄酒似乎不再为人所知。 (这就是数据集的年龄。)但是,我们知道模型不关心标签 - 它们使用数字 - 因此我们将使用 0 和 1 作为标签。
We need a two-class dataset to explore the models in Figure 4-2. The dataset we’ll use is a classic one that attempts to distinguish between two cultivars of grapes used to make wine in a particular region of Italy. Unfortunately, the wines represented by the dataset are, it seems, no longer known. (That’s how old the dataset is.) However, we know that models don’t care about the labels—they use numbers—so we’ll use 0 and 1 as the labels.
我们需要三个特征:x 0、x 1和x 2。我们将使用的特征是酒精含量(百分比)、苹果酸和总酚。目标是训练图 4-2中的模型,看看在给定三个特征的测量值的情况下,每个模型在识别未知葡萄酒时的表现如何。
We need three features, x0, x1, and x2. The features we’ll use are alcohol content in percent, malic acid, and total phenols. The goal is to train the models in Figure 4-2 to see how well each performs when identifying an unknown wine given measurements of the three features.
我使用 104 个样本的训练集和 26 个样本的测试集来训练双神经元模型。这意味着我使用了 104 个测量的酒精含量、苹果酸水平和总酚的三元组,知道正确的输出标签、0 类或 1 类。训练集调节双神经元模型,为所有八个权重和三个偏差提供值。我保证我们将讨论训练是如何进行的,但现在假设它发生了,这样我们就可以探索神经网络的行为方式。经过训练的模型在测试集上的准确率达到了 81%,这意味着它的准确率比 10 次中的 8 次要好。对于如此小的模型和训练集来说,这还算不错。
I trained the two-neuron model using a training set of 104 samples and a test set of 26 samples. This means I used 104 triplets of measured alcohol content, malic acid level, and total phenols, knowing the proper output label, class 0 or class 1. The training set conditioned the two-neuron model to give values to all eight weights and three biases. I promise we will discuss how training works, but for now, assume it happens so we can explore how neural networks behave. The trained model achieved an accuracy on the test set of 81 percent, meaning it was right better than 8 times out of 10. That’s not too bad for such a small model and training set.
图 4-3展示了经过训练的双神经元模型。我为链接添加了权重,为节点添加了偏差,以便您可以看到它们。我认为值得至少看一次这些数字,而且最好用一个简单的模型来做到这一点。
Figure 4-3 presents the trained two-neuron model. I added the weights to the links and the biases to the nodes so you can see them. I think it’s worth looking at the numbers at least once, and it’s best to do that with a simple model.
图 4-3:在葡萄酒数据集上训练的双神经元模型
Figure 4-3: The two-neuron model trained on the wine dataset
让我们使用带有两个测试样本的模型来了解该过程。两个测试样本各包含三个数字,即特征值 ( x 0 , x 1 , x 2 ):
Let’s use the model with two test samples to understand the process. The two test samples consist of three numbers each, the values of the features, (x0, x1, x2):
样品1 Sample 1 |
(–0.7359、0.9795、–0.1333) (–0.7359, 0.9795, –0.1333) |
样品2 Sample 2 |
(0.0967、–1.2138、–1.0500) ( 0.0967, –1.2138, –1.0500) |
此时您可能有一个疑问。我说的特征是酒精含量(百分比)、苹果酸含量和总酚。虽然我不知道测量苹果酸或总酚的单位是什么,但百分比就是百分比,那么为什么第一个样品的x 0是一个很小的负数?我们的酒精含量不能为负值。
You may have a question at this point. I said the features were alcohol content in percent, malic acid level, and total phenols. While I have no idea what the units are for measuring malic acid or total phenols, a percent is a percent, so why is x0 for the first sample a small negative number? We can’t have a negative percentage of alcohol.
答案与预处理有关。原始数据(例如酒精百分比)很少原样用于机器学习模型。相反,每个特征的调整方法是减去训练集上特征的平均值,然后将结果除以数据在平均值周围的分散程度(标准差)。最初的酒精含量为12.29%,对于葡萄酒来说是一个合理的值,但经过缩放后,它变成了–0.7359。
The answer has to do with preprocessing. Raw data, like the percent alcohol, is seldom used with machine learning models as is. Instead, each feature is adjusted by subtracting the average value of the feature over the training set and dividing that result by a measure of how scattered the data is around the average value (the standard deviation). The original alcohol content was 12.29 percent, a reasonable value for wine, but after scaling, it became –0.7359.
让我们使用图 4-3中学习到的权重和偏差对样本 1 进行分类。顶部神经元的输入是每个特征乘以该特征与神经元连接线上的权重,然后与偏差值相加。第一个特征为我们提供了 0.4716 × –0.7359,第二个为 0.0399 × 0.9795,第三个为 –0.3902 × –0.1333,偏差值为 0.0532。将所有这些加在一起得到 – 0.2028。这是传递给激活函数 ReLU 的数字。由于它是负数,因此 ReLU 返回 0,这意味着顶部节点的输出为 0。对底部节点重复计算,得到 0.1720 作为 ReLU 的输入。这是一个正数,因此 ReLU 返回 0.1720 作为输出。
Let’s classify sample 1 using the learned weights and biases in Figure 4-3. The input to the top neuron is each feature multiplied by the weight on the line connecting that feature to the neuron, then summed with the bias value. The first feature gives us 0.4716 × –0.7359, the second 0.0399 × 0.9795, and the third –0.3902 × –0.1333, with bias value 0.0532. Adding all of these together gives – 0.2028. This is the number passed to the activation function, a ReLU. Since it is negative, the ReLU returns 0, meaning the output from the top node is 0. Repeating the calculation for the bottom node gives 0.1720 as the input to the ReLU. That’s a positive number, so the ReLU returns 0.1720 as the output.
中间层两个节点的输出现在用作右侧最终节点的输入。和以前一样,我们将输出乘以权重,将它们与偏差值相加,然后将其传递给激活函数。在这种情况下,激活函数不是 ReLU,而是 sigmoid。
The outputs of the two nodes in the middle layer are now used as the inputs to the final node on the right. As before, we multiply the outputs by the weights, add them along with the bias value, and pass that to the activation function. In this case, the activation function is not a ReLU but a sigmoid.
顶部节点的输出为0,底部节点的输出为0.1720。将它们乘以各自的权重,求和,然后加上偏差值 2.2277,得到 1.9502 作为 sigmoid 激活函数的参数,产生 0.8755 作为第一个输入样本的网络输出。
The top node’s output is 0, and the bottom’s output is 0.1720. Multiplying these by their respective weights, summing, and adding the bias value of 2.2277 gives us 1.9502 as the argument to the sigmoid activation function, producing 0.8755 as the network’s output for the first input sample.
我们应该如何解释这个输出?这是我们学习神经网络的一个重要方面的地方:
How should we interpret this output? Here’s where we learn an important aspect of neural networks:
神经网络不会告诉我们输入的实际类别标签,而只会告诉我们一个标签相对于另一个标签的置信度。
Neural networks don’t tell us the actual class label for the input, but only their confidence in one label relative to another.
二元模型输出一个置信值,我们将其解释为输入属于类别 1 的概率。概率是 0(没有机会)和 1(绝对有保证)之间的数字。人类通常更喜欢使用百分比,即概率乘以 100 得到的百分比。因此,我们可以说网络有超过 87% 的信心认为该输入代表第 1 类的实例。
Binary models output a confidence value that we’re interpreting as the probability of the input belonging to class 1. Probabilities are numbers between 0 (no chance) and 1 (absolutely assured). Humans are generally more comfortable with percentages, which we get by multiplying the probability by 100. Therefore, we can say that the network is a little more than 87 percent confident that this input represents an instance of class 1.
在实践中,我们使用阈值(截止值)来决定分配哪个标签。二元模型最常见的方法是将阈值设置为 50%。如果输出超过 50%(概率 0.5),我们将输入分配给类别 1。此输出高于 50%,因此我们将“类别 1”指定为标签。该样本来自第 1 类,这意味着网络分配的标签是正确的。
In practice, we use a threshold—a cutoff value—to decide which label to assign. The most common approach for binary models is a threshold of 50 percent. If the output exceeds 50 percent (probability 0.5), we assign the input to class 1. This output is above 50 percent, so we assign “class 1” as the label. This sample is from class 1, meaning the network’s assigned label is correct.
我们可以对第二个输入样本 (0.0967, –1.2138, –1.0500) 重复这些计算。我将把它作为练习留给您,但样本 2 的网络输出是 0.4883。换句话说,网络认为该样本属于类别 1 的置信度为 49%。截止值为 50%,因此我们拒绝 1 类标签并将该输入分配给 0 类。class 是 1 类,因此,在本例中,网络是错误的 — 它将 1 类样本分配给了 0 类。哎呀。
We can repeat these calculations for the second input sample, (0.0967, –1.2138, –1.0500). I’ll leave walking through it to you as an exercise, but the network’s output for sample 2 is 0.4883. In other words, the network’s confidence that this sample belongs to class 1 is 49 percent. The cutoff is 50 percent, so we reject the class 1 label and assign this input to class 0. The actual class is class 1, so, in this instance, the network is wrong—it assigned a class 1 sample to class 0. Oops.
这是一个有用的模型吗?答案取决于上下文。我们按品种对葡萄酒进行分类。如果模型的输出有 20% 的概率(即五分之一)是错误的,这是可以接受的吗?我怀疑不会,但在其他任务中,具有这种准确度的模型可能是可以接受的。
Is this a useful model? The answer depends on the context. We’re classifying wine by cultivar. If the model’s output is wrong 20 percent of the time, which is one time in five, is that acceptable? I suspect not, but there might be other tasks where a model with this level of accuracy is acceptable.
神经网络对其输出的解释方式提供了一些控制。例如,我们可能不会使用 50% 作为截止值。如果我们将其降低,例如 40%,我们将捕获更多 1 类样本,但代价是错误地将更多实际 0 类样本识别为 1 类样本。换句话说,我们需要权衡一种错误其他。
Neural networks offer some control over how their outputs are interpreted. For example, we might not use 50 percent as the cutoff. If we make it lower, say, 40 percent, we’ll capture more class 1 samples, but at the expense of mistakenly identifying more actual class 0 samples as class 1. In other words, we get to trade off one kind of error for another.
让我们将图 4-2中的其他模型也加入进来。我使用与图 4-3相同的训练和测试集训练了所有三个模型。我对这三个模型分别重复了 240 次这个过程。以下是平均准确度:
Let’s bring the other models in Figure 4-2 into the mix. I trained all three models using the same training and test sets used for Figure 4-3. I repeated the process 240 times for each of the three models. Here are the average accuracies:
2节点 2-node |
81.5% 81.5 percent |
3节点 3-node |
83.6% 83.6 percent |
8节点 8-node |
86.2% 86.2 percent |
模型的性能随着隐藏层节点数量的增加而提高。这具有直观意义,因为更复杂的模型(更多节点)意味着学习隐藏在训练集中的更复杂关联的能力。
The model’s performance improves as the number of nodes in the hidden layer increases. This makes intuitive sense, as a more complex model (more nodes) implies the ability to learn more complex associations hidden within the training set.
我怀疑您现在有一个新问题:为什么我对每个模型训练 240 次并报告所有 240 个模型的平均准确度?这是理解神经网络的另一个关键点:
I suspect you now have a new question: why did I train each model 240 times and report the average accuracy over all 240 models? Here’s another critical thing to understand about neural networks:
神经网络是随机初始化的,因此即使使用相同的训练数据,重复训练也会导致模型表现不同。
Neural networks are randomly initialized, such that repeated training leads to differently performing models even when using the same training data.
“随机初始化”一词需要澄清。再看一下图 4-3。代表权重和偏差的数字来自迭代过程。这意味着初始的权重和偏差集会被重复更新,每次都会使网络越来越接近连接输入特征向量和输出标签的函数。我们希望网络能够很好地逼近这个函数。
The phrase “randomly initialized” demands clarification. Look again at Figure 4-3. The numbers representing the weights and biases came from an iterative process. This means that an initial set of weights and biases are updated repeatedly, each time moving the network toward a better and better approximation of whatever function it is that links the input feature vectors and the output labels. Approximating this function well is what we want the network to do.
为什么不将所有权重初始化为相同的值?答案是,这样做会迫使权重学习数据的相似特征,这是我们不想要的,最终模型将表现不佳。如果我们将所有初始权重设置为零,则模型根本不会学习。
Why not initialize all the weights to the same value? The answer is that doing so forces the weights to learn similar characteristics of the data, which is something we don’t want, and in the end the model will perform poorly. If we set all of the initial weights to zero, the model does not learn at all.
迭代过程的运行需要一组初始值。我们应该如何选择初始值?这是一个重要的问题,我们目前的理解水平的答案是“随机的”,这意味着我们在某种意义上掷骰子来获得每个权重和偏差的初始值。然后,迭代过程会细化这些值以达到图 4-3中的最终集合。
An initial set of values are necessary for the iterative process to work. How should we pick the initial values? That’s an important question, and the answer for our current level of understanding is “at random,” meaning we roll dice, in a sense, to get the initial value for each weight and bias. The iterative process then refines these values to arrive at the final set in Figure 4-3.
然而,迭代过程并不总是在同一个地方结束。选择一组不同的随机初始权重和偏差,网络将收敛到一组不同的最终值。例如,如前所述,图 4-3中的网络达到了 81% 的准确率。以下是在相同数据上训练和测试的同一网络的 10 个以上准确率:
However, the iterative process doesn’t always end in the same place. Pick a different random set of initial weights and biases, and the network will converge to a different set of final values. For example, the network in Figure 4-3 achieved an accuracy of 81 percent, as mentioned previously. Here are 10 more accuracies for the same network trained and tested on the same data:
89, 85, 73, 81, 81, 81, 81, 85, 85, 85
89, 85, 73, 81, 81, 81, 81, 85, 85, 85
准确率从最高 89% 到最低 73% 不等。每次训练之间发生的所有变化是初始权重和偏差的收集。这是神经网络经常被忽视的问题。如果可行的话,应该对网络进行多次训练,以收集有关其有效性的数据,或者像 73% 的网络版本一样,了解一组错误的初始值纯粹是偶然使用的。我还应该提到,该网络的准确性的巨大变化与其相对较小且仅包含少量权重和偏差有关。较大的模型在反复训练时往往会更加一致。
The accuracies range from a high of 89 percent to a low of 73 percent. All that changed between each training session was the collection of initial weights and biases. This is an often overlooked issue with neural networks. Networks should be trained multiple times, if feasible, to gather data on their effectiveness or, as with the 73 percent version of the network, to understand that a bad set of initial values was used purely by chance. I should also mention that the wide variation in the accuracy of this network is related to its being relatively small and containing only a few weights and biases. Larger models tend to be more consistent when trained repeatedly.
我们已经介绍了很多内容,因此需要回顾一下:
We’ve already covered a lot of ground, so a recap is in order:
****
****
现在我们知道什么是神经网络以及它是如何使用的,我们终于来到了问题的关键:神奇的权重和偏差首先来自哪里?在第 2 章中,我简要提到了神经网络在 20 世纪 80 年代的进步得益于两种基本算法:反向传播和梯度下降。这些是神经网络训练的核心算法。
Now that we know what a neural network is and how it’s used, we finally come to the crux of the matter: where do the magic weights and biases come from in the first place? In Chapter 2, I briefly mentioned that neural networks improved in the 1980s thanks to two essential algorithms: backpropagation and gradient descent. These are the algorithms at the heart of neural network training.
我们在第 3 章中参考支持向量机讨论了优化,即根据某些标准找到最佳结果的过程。训练神经网络也是一个优化过程,涉及学习最适合训练数据的权重和偏差。然而,必须小心,使学习到的权重和偏差更有可能符合训练数据的总体趋势,而不是特定训练数据本身的细节。当我们更多地了解培训过程时,我的意思就会变得显而易见。
We discussed optimization, the process of finding the best of something according to some criteria, in Chapter 3 in reference to support vector machines. Training a neural network is also an optimization process, involving learning the weights and biases that best fit the training data. Care must be taken, however, to make it more likely that the learned weights and biases fit general trends in the training data rather than the details of the specific training data itself. What I mean by that will become apparent as we learn more about the training process.
一般的训练算法是:
The general training algorithm is:
这六个步骤包括许多重要术语。值得我们花时间来确保我们了解每个含义的含义。在本章中,架构是指网络使用的层数,通常是隐藏层。我们有输入特征向量,我们可以想象每个隐藏层共同工作来接受输入向量并产生输出向量,然后该输出向量成为下一层的输入,依此类推。对于二元分类器,网络的输出是产生 0 到 1 之间的值的单个节点。我们将在本书后面了解到,这个想法可以扩展到多类输出。
These six steps include many important terms. It’s worth our time to ensure that we have an idea of what each means. In this chapter, architecture refers to the number of layers, typically hidden layers, used by the network. We have our input feature vector, and we can imagine each hidden layer working collectively to accept an input vector and produce an output vector, which then becomes the input to the next layer, and so on. For binary classifiers, the network’s output is a single node producing a value from 0 to 1. We’ll learn later in the book that this idea can be extended to multiclass outputs.
该算法表明训练是一个重复多次的迭代过程。迭代过程有一个起点。如果你想从 A 点走到 B 点,请将一只脚放在另一只脚的前面。这就是迭代部分。 A点是起点。对于神经网络来说,该架构意味着一组权重和偏差。分配给这些权重和偏差的初始值类似于 A 点,训练类似于将一只脚放在另一只脚的前面。
The algorithm indicates that training is an iterative process that repeats many times. Iterative processes have a starting point. If you want to walk from point A to point B, place one foot in front of the other. That’s the iterative part. Point A is the starting point. For a neural network, the architecture implies a set of weights and biases. The initial values assigned to those weights and biases are akin to point A, with training akin to placing one foot in front of the other.
该算法使用短语“平均误差”。什么错误?这就是一个新概念出现的地方。直观地,我们可以看到,简单地为权重和偏差选择一些初始值不太可能导致网络能够准确地对训练数据进行分类。请记住,我们知道训练数据的输入和预期输出。
The algorithm uses the phrase “average error.” What error? Here’s where a new concept enters the picture. Intuitively, we can see that simply picking some initial values for the weights and biases is not likely to lead to a network able to classify the training data accurately. Remember, we know the inputs and the expected outputs for the training data.
假设我们将训练样本 1 通过网络推送,得到一个输出值,也许是 0.44。如果我们知道样本 1 属于类别 1,则网络产生的误差就是预期输出与实际输出之间的差异。此处为 1 – 0.44,即 0.56。相反,一个好的模型可能会为此样本生成 0.97 的输出,误差仅为 0.03。误差越小,模型对样本进行分类的能力就越好。如果我们将所有训练数据或其代表性子集推送到网络中,我们就可以计算每个训练样本的误差并找到整个训练集的平均值。这是(待描述的)反向传播和梯度下降算法用来更新权重和偏差的度量。
Say we push training sample 1 through the network to give us an output value, perhaps 0.44. If we know that sample 1 belongs to class 1, the error made by the network is the difference between the expected output and the actual output. Here, that’s 1 – 0.44, or 0.56. A good model might instead have produced an output of 0.97 for this sample, giving an error of only 0.03. The smaller the error, the better the model is at classifying the sample. If we push all the training data through the network, or a representative subset of it, we can calculate the error for each training sample and find the average over the entire training set. That’s the measure used by the (to be described) backpropagation and gradient descent algorithms to update the weights and biases.
最后,训练算法要求将数据推送到网络,得到错误,更新权重和偏差,然后重复,直到网络“足够好”。在某种程度上,当误差(也称为损失)尽可能接近于零时,就足够好了。如果网络为所有 0 类样本生成 0 作为输出,为所有 1 类样本生成 1 作为输出,那么它在训练数据上表现完美,并且误差将为零。这当然足够好了,但我们必须小心。有时,当发生这种情况时,网络会过度拟合,这意味着它学习了训练数据的所有细节,而没有真正学习数据的总体趋势,而这将使其在野外与未知输入一起使用时表现良好。
Finally, the training algorithm says to push data through the network, get an error, update the weights and biases, and repeat until the network is “good enough.” In a way, good enough is when the error, also called the loss, is as close to zero as possible. If the network produces 0 as the output for all class 0 samples and 1 as the output for all class 1 samples, then it performs perfectly on the training data, and the error will be zero. That’s certainly good enough, but we must be careful. Sometimes when that happens the network is overfitting, meaning it’s learned all the details of the training data without actually learning the general trends of the data that will allow it to perform well when used with unknown inputs in the wild.
在实践中,可以通过多种方式解决过度拟合问题,其中最好的方法是获取更多训练数据。我们使用训练数据作为我们尝试建模的任何过程可能产生的所有可能数据的替代。因此,更多的训练数据意味着更好地表示该数据集合。这是我们在第 1 章中讨论的插值与外推问题。
In practice, overfitting is addressed in several ways, the best of which is acquiring more training data. We use the training data as a stand-in for all the possible data that could be produced by whatever process we are trying to model. Therefore, more training data means a better representation of that data collection. It’s the interpolate versus extrapolate issue we discussed in Chapter 1.
然而,获得更多的训练数据可能是不可能的。替代方案包括调整训练算法,引入一些东西,使网络在学习时不会关注训练数据的不相关细节。您可能会听到提到的一种这样的技术是权重衰减,如果权重值太大,它就会对网络进行惩罚。
However, getting more training data might not be possible. Alternatives include tweaking the training algorithm to introduce things that keep the network from focusing on irrelevant details of the training data while learning. One such technique you may hear mentioned is weight decay, which penalizes the network if it makes the weight values too large.
另一种常见的方法是数据增强。训练数据用完了?不用担心,数据增强将通过稍微修改您已有的数据来发明一些数据。数据增强采用现有的训练数据并对其进行变异以生成新数据,这些新数据可能是由生成实际训练数据的同一过程创建的。例如,如果训练样本是一张狗的图片,那么如果你旋转它、向上移动几个像素、从左到右翻转等等,它仍然是一张狗的图片。每次转换都会产生一个新的训练样本。这可能看起来像是作弊,但在实践中,数据增强是一个强大的正则化器,可以防止网络在训练过程中过度拟合。
Another common approach is data augmentation. Out of training data? No worries, data augmentation will invent some by slightly modifying the data you already have. Data augmentation takes the existing training data and mutates it to produce new data that might plausibly have been created by the same process that made the actual training data. For example, if the training sample is a picture of a dog, it will still be a picture of a dog if you rotate it, shift it up a few pixels, flip it left to right, and so on. Each transformation produces a new training sample. It might seem like cheating, but in practice, data augmentation is a powerful regularizer that keeps the network from overfitting during training.
让我们暂时回到初始化,因为多年来它的重要性并未得到充分认识。
Let’s return for a moment to initialization, as its importance was not sufficiently appreciated for many years.
起初,权重初始化只不过是“选择一个小的随机数”,例如 0.001 或 –0.0056。这在很多时候都有效。然而,它并不能始终如一地发挥作用,而且当它发挥作用时,网络的行为也并不理想。
At first, weight initialization meant nothing more than “pick a small random number” like 0.001 or –0.0056. That worked much of the time. However, it didn’t work consistently, and when it did work, the network’s behavior wasn’t stellar.
深度学习出现后不久,研究人员重新审视了“小随机值”的想法,以寻找更有原则的初始化方法。这些努力的成果就是神经网络至今的初始化方式。需要考虑三个因素:激活函数的形式、来自下层的连接数量(扇入)以及上层的输出数量(扇出)。设计公式是为了使用所有三个因素来选择每层的初始权重。偏差值通常初始化为零。不难证明,如此初始化的网络比以老式方式初始化的网络性能更好。
Shortly after the advent of deep learning, researchers revisited the “small random value” idea in search of a more principled approach to initialization. The fruit of those efforts is the way neural networks are initialized to this day. Three factors need to be considered: the form of the activation function, the number of connections coming from the layer below (fan-in), and the number of outputs to the layer above (fan-out). Formulas were devised to use all three factors to select the initial weights for each layer. Bias values are usually initialized to zero. It isn’t difficult to demonstrate that networks so initialized perform better than those initialized the old-fashioned way.
我们还有两个训练算法步骤需要讨论:反向传播和梯度下降。反向传播通常首先出现,因为它的输出对于梯度下降是必要的。然而,我认为更直观的是理解梯度下降正在做什么,然后用反向传播提供的内容来填补它所需的缺失部分。尽管名字很陌生,但我相信您已经了解这两种算法的本质。
We have two steps of the training algorithm yet to discuss: backpropagation and gradient descent. Backpropagation is often presented first because its output is necessary for gradient descent. However, I think it’s more intuitive to understand what gradient descent is doing, then fill in the missing piece it needs with what backpropagation provides. Despite the unfamiliar names, I am certain you already understand the essence of both algorithms.
****
****
你正站在一片广阔、开阔的草原上,山峦连绵起伏。你是怎么来到这里?你绞尽脑汁,却没有得到答案。然后,最后,你在北边的山谷里发现了一个小村庄。也许那里的人能给你一些答案。但到达那里的最佳方式是什么?
You’re standing in a vast, open grassland of rolling hills. How did you get here? You strain your brain, but no answer comes. Then, finally, you spy a small village to the north, in the valley far below. Perhaps the people there can give you some answers. But what’s the best way to get there?
一般来说,你想要向北和向下,但你也必须尊重土地的轮廓。你总是想从较高的位置移动到较低的位置。你不能向正北走,因为一座大山挡住了你的路。你可以向东北方向走;那里地势比较平坦,但是走那条路会让你的旅程变得很长,因为地势缓慢下降。因此,您决定向西北方向行驶,因为这会导致您向北和向下移动的坡度比向东移动的陡峭程度更高。您向西北迈出一步,然后停下来重新评估您的位置,以决定下一步移动的方向。
You want to go north and down, in general, but you must also respect the contour of the land. You always want to move from a higher to a lower position. You can’t go due north because a large hill is in your way. You could head northeast; the terrain is flatter there, but going that way will make your journey a long one, as the land drops slowly. So, you decide to head northwest, as that moves you both north and down more steeply than to the east. You take a step to the northwest, then pause to reassess your position to decide which direction to move in next.
重复这个两阶段的过程,检查你当前的位置,以确定最能让你向北和向下移动的方向,然后朝那个方向迈出一步,这是到达山谷村庄的最佳选择。你可能不会成功;您可能会被困在一个小峡谷中,无法爬出。但总体而言,通过持续朝着相对于当前位置的北下方向移动,您将朝着目标取得进展。
Repeating this two-stage process of examining your current position to determine the direction that best moves you both northward and downward, then taking a step in that direction, is your best bet for reaching the village in the valley. You may not make it; you might get stuck in a small canyon out of which you can’t climb. But overall, you’ll make progress toward your goal by consistently moving in a direction that is north and down relative to your current position.
遵循这个称为梯度下降的过程,我们可以调整神经网络的初始权重和偏差,从而为我们提供性能更好的模型。换句话说,梯度下降训练模型。
Following this process, known as gradient descent, lets us adjust a neural network’s initial weights and biases to give us ever better-performing models. In other words, gradient descent trains the model.
村庄周围草地的三维世界对应于网络的n维世界,其中n是我们试图学习其值的权重和偏差的总数。选择从当前位置开始的方向,然后朝该方向移动一段距离,这就是梯度下降步骤。重复的梯度下降步骤让你越来越接近村庄。
The three-dimensional world of the grassland surrounding the village corresponds to the n-dimensional world of the network, where n is the total number of weights and biases whose values we are trying to learn. Choosing a direction to head in from your current position and then moving some distance in that direction is a gradient descent step. Repeated gradient descent steps move you closer and closer to the village.
梯度下降寻找最小位置,即山谷中的村庄,但最小的位置是什么呢?对于神经网络来说,梯度下降旨在调整网络的权重和偏差,以最小化训练集上的误差。
Gradient descent seeks the minimum position, the village in the valley—but the minimum of what? For a neural network, gradient descent aims to adjust the weights and biases of the network to minimize the error over the training set.
广阔而开阔的连绵起伏的丘陵草地代表了误差函数,即使用与当前位置相对应的权重和偏差值时训练数据的平均误差。这意味着草原中的每个位置都隐含着一套完整的网络权重和偏差。村庄的位置对应于网络在训练集上可以产生的最小误差。希望训练集上有小误差的模型在用于未知输入时将很少出现错误。野生的。梯度下降是一种在权重和偏差空间中移动以最小化误差的算法。
The vast, open grassland of rolling hills represents the error function, the average error over the training data when using the weight and bias values corresponding to your current position. This means that each position in the grassland implies a complete set of network weights and biases. The position of the village corresponds to the smallest error the network can make on the training set. The hope is that a model that has a small error on its training set will make few errors on unknown inputs when used in the wild. Gradient descent is the algorithm that moves through the space of weights and biases to minimize the error.
梯度下降是一种优化算法,再次告诉我们训练神经网络是一个优化问题,一个我们需要找到最好的集合的问题。虽然这是事实,但训练神经网络与其他优化问题也有细微的不同。如前所述,我们不一定希望训练数据的误差尽可能小,而是希望模型能够最好地推广到未知输入。我们希望避免过度拟合。我将在本章后面直观地演示这意味着什么。
Gradient descent is an optimization algorithm, again telling us that training a neural network is an optimization problem, a problem where we need to find the best set of something. While this is true, it is also true that training a neural network is subtly different from other optimization problems. As mentioned previously, we don’t necessarily want the smallest possible error on the training data, but rather the model that best generalizes to unknown inputs. We want to avoid overfitting. I’ll demonstrate visually what that means later in the chapter.
梯度下降穿过误差函数的景观。在日常使用中,渐变是事物的变化,例如道路的陡峭程度或从一种色调平滑变化到另一种色调的颜色渐变。从数学上讲,梯度是曲线在某一点的斜率的多维模拟。最陡的移动方向是沿着最大坡度向下移动。曲线上一点处的直线斜率是梯度的有用表示,因此考虑斜率值得我们花时间。
Gradient descent moves through the landscape of the error function. In everyday use, a gradient is a change in something, like the steepness of a road or a color gradient varying smoothly from one shade to another. Mathematically, a gradient is the multidimensional analog of the slope of a curve at a point. The steepest direction to move is down the maximum gradient. The slope of a line at a point on a curve is a helpful representation of the gradient, so contemplating slopes is a worthy use of our time.
图 4-4显示了一条曲线,有四条线在不同点相交。这些线代表这些点的斜率。斜率表示函数值在该点附近变化的速度。线越陡,沿x轴移动时函数值的变化越快。
Figure 4-4 shows a curve with four lines touching it at different points. The lines represent the slope at those points. The slope indicates how quickly the value of the function changes in the vicinity of the point. The steeper the line, the faster the function’s value changes as you move along the x-axis.
图 4-4:标有各个点斜率的曲线
Figure 4-4: A curve with the slope at various points marked
线 B 标记曲线上的最低点。这是全局最小值,也是优化算法寻求找到的点。请注意,接触该点的线完全是水平的。从数学上来说,这意味着直线 B 的斜率为零。这在函数的最小值(和最大值)处是正确的。
Line B marks the lowest point on the curve. This is the global minimum and the point that an optimization algorithm seeks to find. Notice that the line touching this point is entirely horizontal. Mathematically, this means that the slope of line B is zero. This is true at the minima (and maxima) of functions.
线 B 触及的点是全局最小值,但图中还有其他三个最小值。这些是局部最小值点,其中接触这些点的线的斜率也为零。理想情况下,优化算法将避免这些点,有利于全局最小值。
The point touched by line B is the global minimum, but there are three other minima in the plot. These are local minima, points where the slope of the line touching those points is also zero. Ideally, an optimization algorithm would avoid these points, favoring the global minimum.
A 线很陡并指向全局最小值。因此,如果我们位于曲线上 A 线所触及的点,我们可以通过沿着指示的方向采取步骤,快速接近全局最小值。而且,由于这里的坡度很陡,我们可以走相当大的台阶下到山谷里。
Line A is steep and points toward the global minimum. Therefore, if we were at the point on the curve touched by line A, we could move quickly toward the global minimum by taking steps in the indicated direction. Moreover, as the slope is steep here, we can take reasonably large steps down to the valley.
C 线也很陡,但朝向局部极小值之一,即x轴上刚好超过 3 的极小值。只知道如何向下移动梯度的梯度下降算法将找到局部最小值并卡在那里。这同样适用于线 D,它朝向x轴上 4 到 5 之间的局部最小值。
Line C is also steep but heads toward one of the local minima, the one just beyond 3 on the x-axis. A gradient descent algorithm that only knows how to move down the gradient will locate that local minimum and become stuck there. The same applies to line D, which heads toward the local minimum between 4 and 5 on the x-axis.
从图 4-4中可以看出什么?首先,梯度下降从某个点沿着梯度或斜率向下移动。这里的曲线是一维的,因此该点是x的特定值。梯度下降使用该点的斜率值来选择与斜率的陡度成比例的方向和步长。陡峭的斜率意味着我们可以采取更大的步骤来最终得到更接近最小值的新x值。浅坡度意味着较小的台阶。
What are the takeaways from Figure 4-4? First, gradient descent moves down the gradient, or slope, from some point. Here the curve is one- dimensional, so the point is a specific value of x. Gradient descent uses the value of the slope at that point to pick a direction and a step size proportional to the steepness of the slope. A steep slope means we can take a larger step to end up at a new x value closer to a minimum. A shallow slope implies a smaller step.
例如,假设我们最初位于直线 A 与曲线相交的点。斜率很陡,所以我们向全球最小值迈出了一大步。在该步骤之后,我们再次查看斜率,但这次是x轴上新点的斜率。使用该斜率,我们再采取一步,然后再采取另一步,直到到达斜率基本上为零的点。这是最低限度,所以我们停下来。
For example, suppose we are initially at the point where line A touches the curve. The slope is steep, so we take a big step toward the global minimum. After the step, we look at the slope again, but this time it’s the slope at the new point on the x-axis. Using that slope, we take another step, then another, and another until we get to a point where the slope is essentially zero. That’s the minimum, so we stop.
一维情况非常简单,因为每个点只有一个斜率,因此只有一个方向。然而,回想一下广阔无垠的草原,我们知道,从任何一点出发,我们都可能朝无数个方向前进,其中许多方向都是有用的,因为它们可以让我们向北和向下。其中一个方向,即最大梯度的方向,是最陡的,并且可以最快地使我们朝所需的目的地移动,这就是我们步入的方向。重复该过程,每次都使用最大梯度方向,在多个维度上完成什么我们在一个维度上做了。准确地说,我们沿着与最大梯度相反的方向前进,因为最大梯度远离最小值,而不是指向最小值。
The one-dimensional case is straightforward enough because at each point there is only one slope, so there is only one direction to go. However, recalling the vast, open grassland, we know that from any point there are an infinite number of directions we might head in, many of which are useful in that they move us northward and downward. One of these directions, the direction of the maximum gradient, is the steepest and moves us most quickly toward our desired destination, and that’s the direction we step in. Repeating the process, using the maximum gradient direction each time, accomplishes in multiple dimensions what we did in one dimension. To be precise, we step in the direction opposite the maximum gradient because the maximum gradient points away from the minimum, not toward it.
图 4-5展示了二维梯度下降。该图显示了等值线图。想象一下一个具有梯田的露天矿井:阴影越浅,矿井越深,但斜坡也越平坦。也就是说,较浅的阴影意味着较浅的坡度。
Figure 4-5 presents gradient descent in two dimensions. The figure shows a contour plot. Imagine an open pit mine with terraced levels: the lighter the shade, the deeper into the mine, but also the flatter the slope. That is, lighter shades imply shallower slopes.
图 4-5:二维梯度下降
Figure 4-5: Gradient descent in two dimensions
该图显示了梯度下降在三个起始位置(圆形、三角形和正方形)所采取的路径。最初,坡度很陡,因此步长很大,但随着接近最小值,坡度变浅,这意味着步长较小。最终,无论起点如何,梯度下降都会达到最小值。
The figure shows the path taken by gradient descent for three starting positions: the circle, the triangle, and the square. Initially, the slopes are steep, so the step sizes are big, but the slopes become shallow as the minimum is approached, implying smaller steps. Eventually, gradient descent reaches the minimum, regardless of the starting point.
我们讨论了一维和二维的梯度下降,因为我们可以可视化该过程。现在我们明白了,我们一直都知道这个算法,并且每当我们从较高的海拔走到较低的海拔时我们自己都会使用它。老实说,这就是训练神经网络所做的全部工作。权重和偏差的初始集合只不过是n维空间中的单个起点。梯度下降使用从初始起始位置到最小值的最大梯度。n维空间中的每个新位置都是基于梯度陡度从前一组生成的n 个权重和偏差的新集合。当梯度变得非常小时,我们声称胜利并修复权重和偏差,相信网络已经被训练。
We’ve discussed gradient descent in one and two dimensions because we can visualize the process. We understand now that we have always known the algorithm and used it ourselves whenever we walk from a higher elevation to a lower one. Honestly, this is all that training a neural network does. The initial set of weights and biases is nothing more than a single starting point in an n-dimensional space. Gradient descent uses the maximum gradient from that initial starting position to march toward a minimum. Each new position in the n-dimensional space is a new set of the n weights and biases generated from the previous set based on the steepness of the gradient. When the gradient gets very small, we claim victory and fix the weights and biases, believing the network to be trained.
梯度下降取决于斜率和梯度值。但是梯度从哪里来呢?梯度下降最小化损失函数或网络产生的错误。训练集上的误差是网络中每个权重和偏差值的函数。梯度表示每个权重和偏差对总体误差的贡献程度。
Gradient descent depends on slopes, on the value of the gradient. But where do the gradients come from? Gradient descent minimizes the loss function, or the error made by the network. The error over the training set is a function of each weight and bias value in the network. The gradient represents how much each weight and bias contributes to the overall error.
例如,假设我们知道权重 3(无论标记的权重是什么)对网络误差的贡献有多大,该误差是通过网络在训练集上犯的错误来衡量的。在这种情况下,如果我们改变权重 3 的值,并保持所有其他权重和偏差相同,我们就知道梯度的陡度。该陡度乘以步长,为我们提供了一个从权重 3 的当前值中减去的值。通过减法,我们向与最大梯度相反的方向移动。重复计算对于网络中的每个权重和偏差,都会在n维空间中迈出一步。这就是梯度下降在训练期间所做的事情。
For example, suppose we know how much weight 3 (whatever weight that labels) contributes to the network’s error as measured by the mistakes the network makes on the training set. In that case, we know the steepness of the gradient should we change weight 3’s value, keeping all other weights and biases the same. That steepness, multiplied by a step size, gives us a value to subtract from weight 3’s current value. By subtracting, we move in the direction opposite to the maximum gradient. Repeating the calculation for every weight and bias in the network takes a step in the n-dimensional space. This is what gradient descent does during training.
反向传播是一种为我们提供每个权重和偏差的陡度值的算法。反向传播是微分学著名规则的应用,微分学是数学的一个分支,告诉我们一件事如何随着另一件事的变化而变化。速度就是一个例子。速度表示距离如何随时间变化。它甚至体现在我们谈论速度的方式中:英里每小时或公里每小时。反向传播为我们提供了“速度”,表示网络误差如何随着任何权重或偏差值的变化而变化。梯度下降使用这些“速度”乘以称为学习率的比例因子,逐步到达网络的n 个权重和偏差表示的n维空间中的下一个位置。
Backpropagation is the algorithm that gives us the steepness values per weight and bias. Backpropagation is an application of a well-known rule from differential calculus, the branch of mathematics telling us how one thing changes as another changes. Speed is an example. Speed indicates how distance changes with time. It’s even in how we talk about speed: miles per hour or kilometers per hour. Backpropagation gives us the “speed” representing how the network’s error changes with a change in any weight or bias value. Gradient descent uses these “speeds,” multiplied by a scale factor known as the learning rate, to step to the next position in the n-dimensional space represented by the n weights and biases of the network.
例如,图 4-2中的“大”网络有 32 个权重和 9 个偏差;因此,使用梯度下降训练该网络意味着在 41 维空间中移动以找到 41 个权重和偏差值,从而使我们在训练集上获得最小的平均误差。
For example, the “big” network in Figure 4-2 has 32 weights and 9 biases; therefore, training that network with gradient descent means moving through a 41-dimensional space to find the 41 weight and bias values giving us the smallest error averaged over the training set.
该算法被称为“反向传播”,因为它计算每个权重和偏差的“速度”值,从网络的输出层开始,然后逐层向后移动到输入层。也就是说,它通过网络向后移动,将误差从一层传播到前一层。
The algorithm is called “backpropagation” because it calculates the “speed” values for each weight and bias, beginning with the network’s output layer and then moving backward, layer by layer, to the input layer. That is, it moves backward through the network to propagate the error from a layer to the previous layer.
带回家的信息是这样的:
The take-home message is this:
梯度下降使用反向传播提供的梯度方向来迭代更新权重和偏差,以最小化网络在训练集上的误差。
Gradient descent uses the gradient direction supplied by backpropagation to iteratively update the weights and biases to minimize the network’s error over the training set.
简而言之,这就是神经网络的训练方式。
And that, in a nutshell, is how neural networks are trained.
****
****
通过反向传播和梯度下降训练神经网络的能力有点侥幸。它不应该工作。反向传播的梯度下降是一种一阶优化方法。一阶优化对于简单函数效果最好,而神经网络的误差面则不然。然而,福尔图娜对我们微笑,它确实起作用了,而且做得相当好。除了认识到误差函数的局部最小值几乎相同之外,目前还没有严格的数学解释,这意味着如果你落在其中并且无法退出,通常也没什么问题。
The ability to train a neural network with backpropagation and gradient descent is a bit of a fluke. It shouldn’t work. Gradient descent with backpropagation is a first-order optimization approach. First-order optimization works best with simple functions, and the error surfaces of a neural network are anything but. However, Fortuna has smiled upon us, and it does work, and rather well at that. There is as yet no rigorous mathematical explanation beyond the realization that the local minima of the error function are all pretty much the same, meaning if you land in one and can’t get out, that’s often just fine.
还有另一种经验解释,但要理解它,我们必须更多地了解训练过程。我在本章前面给出的六步训练算法讨论了通过网络运行训练集或其子集,并重复直到事情“足够好”。让我详细阐述这些步骤所隐含的过程。
There is another empirical explanation, but to understand that, we must learn more about the training process. The six-step training algorithm I gave earlier in the chapter talks about running the training set, or a subset of it, through the network, and repeating until things are “good enough.” Let me expand on the process implied by these steps.
训练数据通过网络的每次传递(前向传递和后向传递)都会产生如图4-5所示的梯度下降步骤。如果训练集很小,则所有训练集都将用于前向传递,这意味着梯度下降将使用所有训练集来决定下一步的步骤。训练数据的完整遍历称为一个epoch;因此,利用所有的训练前向和后向传递中的数据导致每个时期一个梯度下降步骤。
Each pass of training data through the network, a forward pass followed by a backward pass, results in a gradient descent step as shown in Figure 4-5. If the training set is small, all of it is used in the forward pass, meaning all of it is used by gradient descent to decide where to step next. A complete pass through the training data is called an epoch; therefore, using all the training data in the forward and backward passes results in one gradient descent step per epoch.
现代机器学习数据集通常非常庞大,这使得在每个梯度下降步骤中使用所有训练数据在计算上是不可行的。相反,一个小的、随机选择的数据子集(称为小批量)通过网络进行前向和后向传递。使用小批量极大地减少了梯度下降期间的计算开销,从而导致每个时期有许多步骤。小批量还提供了另一个好处,有助于克服“这种训练方法不可行”的问题。
Modern machine learning datasets are often massive, making it computationally infeasible to use all of the training data for each gradient descent step. Instead, a small, randomly selected subset of the data, known as a minibatch, is passed through the network for the forward and backward passes. Using minibatches dramatically reduces the computational overhead during gradient descent, resulting in many steps per epoch. Minibatches also provide another benefit that helps overcome the “this approach to training shouldn’t work” issue.
假设我们有一个代表网络误差的数学函数。在这种情况下,我们可以使用数百年历史的微积分技术来找到每个权重和偏差对误差的贡献的确切形式;梯度下降会知道每次迈出的最佳方向。不幸的是,世界并不是这样的。我们不知道误差函数的数学形式(不太可能知道),因此我们必须用训练数据进行近似。当使用更多训练数据来确定误差时,这种近似会得到改善。这一事实主张在每个梯度下降步骤中使用所有训练数据。然而,我们已经知道,在许多情况下,这在计算上极其繁重。
Suppose we had a mathematical function representing the error made by the network. In that case, we could use centuries-old calculus techniques to find the exact form of each weight and bias’s contribution to the error; gradient descent would know the best direction to step each time. Unfortunately, the world isn’t that kind. We don’t know the mathematical form of the error function (there isn’t likely one to know), so we have to approximate with our training data. This approximation improves when using more training data to determine the error. This fact argues for using all the training data for each gradient descent step. However, we already know this is computationally extremely taxing in many cases.
折衷方案是对每个梯度下降步骤使用小批量。计算不再太费力,但实际梯度的近似值更差,因为我们用更少的数据点来估计它。随机选择某些内容通常与“随机”一词联系在一起,因此使用小批量进行训练被称为随机梯度下降。某种形式的随机梯度下降是几乎所有现代人工智能使用的标准训练方法。
The compromise is to use minibatches for each gradient descent step. The calculations are no longer too taxing, but the approximation of the actual gradient is worse because we are estimating it with fewer data points. Randomly selecting something is often attached to the word “stochastic,” so training with minibatches is known as stochastic gradient descent. Stochastic gradient descent, in one form or another, is the standard training approach used by virtually all modern AI.
乍一看,随机梯度下降听起来像是一个失败的提议。当然,我们可以在宇宙热寂之前计算许多梯度下降步骤,但是我们的梯度保真度很低,而且我们很可能在误差空间中朝着错误的方向移动。这不太好,不是吗?
At first blush, stochastic gradient descent sounds like a losing proposition. Sure, we can calculate many gradient descent steps before the heat death of the universe, but our gradient fidelity is low, and we’re likely moving in the wrong direction through the error space. That can’t be good, can it?
这是福尔图娜第二次对人类微笑的地方。她不仅使我们能够通过一阶梯度下降训练复杂模型,因为局部最小值(假设)大致相同;她还进行了安排,以便随机梯度下降找到的“错误”梯度方向通常是我们在训练过程早期避免局部最小值所需要的。换句话说,当我们应该向正北走时,稍微向东北走是因祸得福,让我们能够训练大型神经网络。
Here’s where Fortuna smiles on humanity a second time. Not only has she given us the ability to train complex models with first-order gradient descent because local minima are (assumed) roughly equivalent; she’s also arranged things so that the “wrong” gradient direction found by stochastic gradient descent is often what we need to avoid local minima early in the training process. In other words, walking slightly northeast when we should head due north is a blessing in disguise that allows us to train large neural networks.
****
****
我们已准备好进入下一章。然而,在此之前,让我们将传统的神经网络应用于恐龙足迹数据集。我们将把结果与第 3 章的经典模型进行比较。
We’re ready to move on to the next chapter. However, before we do, let’s apply traditional neural networks to the dinosaur footprint dataset. We’ll compare the results to the classical models of Chapter 3.
我们首先需要选择一个架构:即隐藏层的数量、每层的节点数量以及每个节点的激活函数的类型。恐龙足迹数据集有两类:鸟臀目(0 级)和兽脚亚目恐龙(1 级)。因此,输出节点应使用 sigmoid 激活函数来为我们提供 1 类成员资格的可能性。网络的输出值估计输入图像代表兽脚亚目恐龙的概率。如果概率高于 50%,我们会将输入分配给第 1 类;否则,进入 0 级。我们将坚持对隐藏层节点使用修正的线性单元激活,就像我们对本章中的所有模型所做的那样。剩下的就是选择隐藏层的数量和每层的节点数量。
We need first to select an architecture: that is, the number of hidden layers, the number of nodes per layer, and the type of activation function for each node. The dinosaur footprint dataset has two classes: ornithischian (class 0) and theropod (class 1). Therefore, the output node should use a sigmoid activation function to give us a likelihood of class 1 membership. The network’s output value estimates the probability that the input image represents a theropod. If the probability is above 50 percent, we’ll assign the input to class 1; otherwise, into class 0 it goes. We’ll stick with rectified linear unit activations for the hidden layer nodes, as we have for all the models in this chapter. All that remains is to select the number of hidden layers and the number of nodes per layer.
足迹数据集中有 1,336 个训练样本。这不是很多,而且我们没有扩充数据集,因此我们需要一个较小的模型。大型模型,意味着许多节点和层,需要大量的训练集;否则,相对于训练样本的数量,需要学习的权重和偏差太多。因此,我们将限制自己对足迹数据集尝试最多两个隐藏层模型。至于隐藏层中的节点数量,我们将让第一个隐藏层的变化范围从非常小到接近 1,600 个特征输入大小的两倍(40×40 像素图像展开)。如果我们尝试第二个隐藏层,我们会将节点数量限制为不超过第一个隐藏层中数量的一半。
There are 1,336 training samples in the footprints dataset. That’s not a lot, and we aren’t augmenting the dataset, so we need a smallish model. Large models, meaning many nodes and layers, require large training sets; otherwise, there are too many weights and biases to learn relative to the number of training samples. Therefore, we’ll limit ourselves to trying at most two hidden layer models for the footprints dataset. As for the number of nodes in the hidden layers, we’ll let the first hidden layer vary from very small to nearly twice the input size of 1,600 features (the 40×40-pixel image unraveled). If we try a second hidden layer, we’ll restrict the number of nodes to no more than half the number in the first hidden layer.
首先,我们将训练一层和两层架构的集合。其次,我们将训练这 100 次中表现最好的一个,以获得平均水平的表现。表 4-1列出了试验模型的结果。
First, we’ll train a collection of one- and two-layer architectures. Second, we’ll train the best performing of those 100 times to give us an average level of performance. Table 4-1 presents the trial models’ results.
表 4-1:恐龙足迹数据集的试验架构
Table 4-1: Trial Architectures with the Dinosaur Footprint Dataset
准确性 (%) Accuracy (%) |
建筑学 Architecture |
权重和偏差 Weights and biases |
|---|---|---|
59.4 59.4 |
10 10 |
16,021 16,021 |
77.0 77.0 |
400 400 |
640,801 640,801 |
76.7 76.7 |
800 800 |
1,281,601 1,281,601 |
81.2 81.2 |
2,400 2,400 |
3,844,801 3,844,801 |
75.8 75.8 |
100, 50 100, 50 |
165,201 165,201 |
81.2 81.2 |
800, 100 800, 100 |
1,361,001 1,361,001 |
77.9 77.9 |
2,400, 800 2,400, 800 |
5,764,001 5,764,001 |
隐藏层只有 10 个节点的网络最差,返回的准确率约为 60%。除了抛硬币之外什么也不做的二元分类器的正确率约为 50%,因此 10 节点网络的性能仅略高于机会值。我们不想要那个。大多数其他网络返回的准确度在 70 左右。
The network with a mere 10 nodes in its hidden layer was the worst, returning an accuracy of about 60 percent. A binary classifier that does nothing but flips a coin is correct about 50 percent of the time, so the 10-node network is performing only slightly above chance. We don’t want that one. Most of the other networks return accuracies in the mid- to upper 70s.
两个粗体模型的准确率均略高于 81%。第一个使用包含 2,400 个节点的单个隐藏层。第二个使用 800 个节点的隐藏层,其次是另一个 100 个节点的隐藏层。两种模型在测试集上产生的准确率相同,但 2,400 节点模型的权重和偏差几乎是双层模型的三倍,因此我们将使用双层模型。 (请记住,表 4-1中的结果代表单个训练课程,而不是多次训练的平均值。我们很快就会解决这个问题。)
The two models in bold each produced just over 81 percent accuracy. The first used a single hidden layer of 2,400 nodes. The second used a hidden layer of 800 nodes, followed by another with 100 nodes. Both models produced the same accuracy on the test set, but the 2,400-node model had nearly three times as many weights and biases as the two-layer model, so we’ll go with the two-layer model. (Bear in mind that the results in Table 4-1 represent a single training session, not the average of many. We’ll fix that shortly.)
两层模型还是比较大的。我们正在尝试学习 140 万个参数来调节模型,以正确分类恐龙足迹图像。有很多参数需要学习,尤其是训练集只有 1,336 个样本。全连接神经网络所需参数的数量快速增长。我们将在第五章讨论卷积神经网络时重新审视这一观察结果。
The two-layer model is still relatively large. We’re trying to learn 1.4 million parameters to condition the model to correctly classify the dinosaur footprint images. That’s a lot of parameters to learn, especially with a training set of only 1,336 samples. Fully connected neural networks grow quickly in terms of the number of parameters required. We’ll revisit this observation in Chapter 5 when discussing convolutional neural networks.
我们有自己的架构:两个隐藏层,分别使用具有 800 和 100 个节点的修正线性激活函数,后跟一个使用 sigmoid 的单个节点,为我们提供 1 类成员资格的可能性。在足迹数据集上训练模型 100 次,平均准确率为 77.4%,最低为 69.3%,最高为 81.5%。让我们将这个结果与第 3 章的结果建立适当的关系;见表4-2。
We have our architecture: two hidden layers using rectified linear activation functions with 800 and 100 nodes, respectively, followed by a single node using a sigmoid to give us a likelihood of class 1 membership. Training the model 100 times on the footprints dataset returned an average accuracy of 77.4 percent, with a minimum of 69.3 percent and a maximum of 81.5 percent. Let’s put this result in its proper relation to those of Chapter 3; see Table 4-2.
表 4-2:恐龙足迹模型
Table 4-2: Dinosaur Footprint Models
模型 Model |
准确性 (%) Accuracy (%) |
|---|---|
RF300 RF300 |
83.3 83.3 |
RBF支持向量机 RBF SVM |
82.4 82.4 |
7-NN 7-NN |
80.0 80.0 |
3-NN 3-NN |
77.6 77.6 |
多层线性规划 MLP |
77.4 77.4 |
1-NN 1-NN |
76.1 76.1 |
线性支持向量机 Linear SVM |
70.7 70.7 |
回想一下,RF300 表示有 300 棵树的随机森林,SVM 表示支持向量机,而 NN 表示最近邻分类器(有点令人困惑)。我使用 MLP(多层感知器)作为神经网络的替代品。多层感知器是我们在本章中讨论的传统神经网络的一个古老但仍然常见的名称——请注意与 Rosenblatt 20 世纪 50 年代末的原始感知器的链接。
Recall that RF300 means a random forest with 300 trees, SVM refers to a support vector machine, and, somewhat confusingly, NN refers to a nearest neighbor classifier. I’m using MLP (multilayer perceptron) as a stand-in for our neural network. Multilayer perceptron is an old but still common name for the traditional neural networks we’ve been discussing in this chapter—notice the link back to Rosenblatt’s original Perceptron from the late 1950s.
我们的神经网络并不是该数据集上表现最好的。事实上,这是最糟糕的情况之一。额外的调整可能会使其在列表中上升一两个位置,但根据我的经验,这种性能水平是典型的,并且在深度学习革命之前促成了神经网络是“meh”模型的普遍看法(双关语)——平淡无奇,没什么可写的。
Our neural network wasn’t the best performer on this dataset. In fact, it was one of the worst. Additional tweaking might move it up a place or two on the list, but this level of performance is typical, in my experience, and contributed to the general perception (pun intended) before the deep learning revolution that neural networks are “meh” models—run-of-the-mill, nothing to write home about.
****
****
本章介绍了现代神经网络背后的基本思想。本书的其余部分以本章介绍的基本概念为基础。以下是主要内容:
This chapter introduced the fundamental ideas behind modern neural networks. The remainder of the book builds on the basic concepts covered in this chapter. Here are the principal takeaways:
现在,让我们继续研究卷积神经网络,这种架构引发了深度学习革命。本章把我们带到了 2000 年代初期。接下来我们将进入 2012 年及以后。
Now, let’s press on to investigate convolutional neural networks, the architecture that ushered in the deep learning revolution. This chapter brought us to the early 2000s. The next moves us to 2012 and beyond.
关键术语
KEY TERMS
激活函数、架构、后向传递、偏差、数据增强、历元、前向传递、全局最小值、梯度下降、隐藏层、学习率、局部最小值、损失、小批量、多层感知器、神经元、节点、过拟合、预处理、线性校正单位、正则化器、S形、随机梯度下降、权重
activation function, architecture, backward pass, bias, data augmentation, epoch, forward pass, global minimum, gradient descent, hidden layer, learning rate, local minimum, loss, minibatch, multilayer perceptron, neuron, node, overfitting, preprocessing, rectified linear unit, regularizer, sigmoid, stochastic gradient descent, weight
经典的机器学习模型面临着适当的特征选择、特征向量维度以及无法从输入固有的结构中学习的问题。卷积神经网络 (CNN)通过学习生成输入的新表示并同时对它们进行分类(这一过程称为端到端学习)来克服这些问题。 CNN 是我在第 2 章中提到的表示学习数据处理器。
Classical machine learning models struggle with appropriate feature selection, feature vector dimensionality, and the inability to learn from the structure inherent in the input. Convolutional neural networks (CNNs) overcome these issues by learning to generate new representations of their inputs while simultaneously classifying them, a process known as end-to-end learning. CNNs are the representation-learning data processors I referred to in Chapter 2.
从 Rosenblatt 的感知器开始,CNN 的元素在神经网络历史的不同时期出现过,但引发深度学习革命的架构是在 1998 年发布的。需要十多年的计算能力的额外改进才能释放深度学习的潜力。随着 2012 年 AlexNet 的出现,CNN 的全部威力得到了体现。
Elements of what became CNNs appeared at various times throughout the history of neural networks, beginning with Rosenblatt’s Perceptron, but the architecture that ushered in the deep learning revolution was published in 1998. Over a decade of additional improvements in computing capability were required to unleash the full power of CNNs with the appearance of AlexNet in 2012.
卷积网络利用其输入中的结构。随着本章的进展,我们将更好地理解这意味着什么。在一维中,输入可能是随时间变化的值,也称为时间系列。在二维中,我们谈论的是图像。三维 CNN 的存在是为了解释大量数据,例如一堆磁共振图像或由 LiDAR 点云构建的数据。在本章中,我们将专门关注二维 CNN。
Convolutional networks exploit structure in their inputs. We’ll better understand what that means as the chapter progresses. In one dimension, the inputs might be values that change over time, also known as a time series. In two dimensions, we’re talking about images. Three-dimensional CNNs exist to interpret volumes of data, like a stack of magnetic resonance images or a volume constructed from a LiDAR point cloud. In this chapter, we’ll focus exclusively on two-dimensional CNNs.
向传统神经网络呈现特征的顺序是无关的。无论我们向模型提供的特征向量是 ( x 0 , x 1 , x 2 ) 还是 ( x 2 , x 0 , x 1 ),模型都会学习得很好,因为它假设这些特征是独立的并且与彼此。事实上,像素值和相邻像素值之间的强相关性是传统机器学习模型所不希望的,并且它们无法利用此类输入取得很大成功,这使得神经网络多年来一直停滞不前。
The order in which features are presented to a traditional neural network is irrelevant. Regardless of whether we present feature vectors to the model as (x0,x1,x2) or (x2,x0,x1), the model will learn just as well because it assumes the features are independent and unrelated to each other. Indeed, a strong correlation between a pixel value and adjacent pixel values is something traditional machine learning models do not want, and their inability to achieve much success with such inputs held neural networks back for years.
另一方面,卷积神经网络利用其输入中的结构。对于 CNN,我们将输入表示为 ( x 0 , x 1 , x 2 ) 还是 ( x 2 , x 0 , x 1 ) 很重要;该模型可能对前者学习得很好,但对后者学习得不好。这不是缺点,而是优点,因为我们希望将 CNN 应用于需要学习结构的情况,该结构有助于确定如何最好地对输入进行分类。
Convolutional neural networks, on the other hand, exploit structure in their inputs. For a CNN, it matters whether we present the input as (x0,x1,x2) or (x2,x0,x1); the model might learn well with the former and poorly with the latter. This isn’t a weakness, but a strength, because we want to apply CNNs to situations where there is structure to learn—structure that helps determine how best to classify inputs.
在本章后面,我们将在对动物和车辆的小照片(第 3 章的 CIFAR-10 数据集)进行分类时,比较传统神经网络与 CNN 的性能。那时,我们将了解利用结构的真正力量。不过,在此之前,让我们先进行一个小实验。我们有两个数据集。第一个是我们的老朋友,MNIST 数字数据集;第二个是相同的数字图像集合,但图像中像素的顺序已被打乱。加扰不是随机的,而是一致的,因此位置 (1,12) 处的像素已移动到位置 (26,13),所有其他像素的移动也类似一致。图 5-1显示了 MNIST 数字和相同数字的加扰版本的一些示例。
Later in the chapter, we’ll compare the performance of a traditional neural network to a CNN when classifying small photos of animals and vehicles (the CIFAR-10 dataset of Chapter 3). At that time, we’ll learn the true power of exploiting structure. Before that, however, let’s conduct a little experiment. We have two datasets. The first is our old friend, the MNIST digits dataset; the second is the same collection of digit images, but the order of the pixels in the images has been scrambled. The scrambling isn’t random but consistent so that the pixel at position (1,12) has been moved to, say, position (26,13), with similarly consistent moves for all other pixels. Figure 5-1 shows some examples of MNIST digits and scrambled versions of the same digits.
图 5-1:MNIST 数字示例(顶部)和相同数字的加扰版本(底部)
Figure 5-1: Example MNIST digits (top) and scrambled versions of the same digits (bottom)
这些乱七八糟的数字让我无法理解。原始数字和加扰数字之间的像素信息是相同的,也就是说,两者中存在相同的像素值集合,但结构基本上消失了,我无法再辨别数字。我认为传统的神经网络会整体对待其输入,而不是寻找结构。如果是这样的话,传统的神经网络不应该关心数字是否被打乱;当使用原始数据集或加扰数据集进行训练时,它应该能够很好地学习。事实证明,事实正是如此。该模型的学习效果同样好;加扰对性能没有任何改变。但请注意,加扰测试数字必须与加扰模型一起使用;我们不应该期望模型在一个数据集上进行训练并在另一个数据集上进行测试时能够工作。
The scrambled digits are incomprehensible to me. The pixel information between the original and scrambled digits is the same—that is, the same collection of pixel values is present in both—but the structure is largely gone, and I can no longer discern the digits. I claim that a traditional neural network treats its inputs holistically and isn’t looking for structure. If that’s the case, a traditional neural network shouldn’t care that the digits have been scrambled; it should learn just as well when trained using the original or the scrambled dataset. As it turns out, that’s precisely what happens. The model learns equally well; scrambling changes nothing in terms of performance. Note, though, that the scrambled test digits must be used with the scrambled model; we shouldn’t expect the model to work when trained on one dataset and tested on the other.
目前我们只知道有关 CNN 的一个事实:它们注重输入的结构。知道了这一点,我们是否应该期望在加扰数据集上训练的 CNN 能够像在原始数据集上训练的那样表现呢?我们无法解释打乱的数字,因为图像中的局部结构已被破坏。因此,我们可能期望同样想要利用局部结构的模型无法解释加扰的数字。情况就是这样:与在原始数据集上训练的 CNN 相比,在扰乱数据集上训练的 CNN 表现较差。
We at present know only one fact about CNNs: they pay attention to structure in their inputs. Knowing this, should we expect a CNN trained on the scrambled dataset to perform as well as one trained on the original dataset? The scrambled digits are uninterpretable by us because local structure in the images has been destroyed. Therefore, we might expect a model that similarly wants to exploit local structure to be unable to interpret the scrambled digits. And that is the case: a CNN trained on the scrambled dataset performs poorly compared to one trained on the original dataset.
为什么我们不能轻松地解读这些乱码数字?我们必须探索视觉期间大脑中发生的事情来回答这个问题。然后我们将回过头来将该过程与 CNN 的工作联系起来。正如我们将了解到的,CNN 遵循古老的格言:入乡随俗。
Why can’t we easily interpret the scrambled digits? We must explore what happens in the brain during vision to answer that question. Then we’ll circle back to relate that process to what CNNs do. As we’ll learn, CNNs follow the old adage: when in Rome, do as the Romans (humans) do.
****
****
文森特·梵高是我最喜欢的艺术家。他的风格中有些东西在向我诉说,一种来自于一个饱受精神疾病折磨的男人的奇怪的平静。我相信他的作品所散发出的平静反映了他试图平息内心的混乱。
Vincent van Gogh is my favorite artist. Something about his style speaks to me, something strangely peaceful from a man tormented by mental illness. I believe the peace emanating from his work reflects his attempt to calm the turmoil within.
考虑图 5-2。它展示了梵高 1889 年著名的阿尔勒卧室画作。图像是黑白的,这对文森特的色彩使用是不可原谅的暴力,但印刷限制要求这样做。
Consider Figure 5-2. It shows Van Gogh’s famous 1889 painting of his bedroom in Arles. The image is in black and white, an unforgivable violence to Vincent’s use of color, but print restrictions require it.
图 5-2:梵高在阿尔勒的卧室,1889 年(公共领域)
Figure 5-2: Van Gogh’s bedroom in Arles, 1889 (public domain)
你在画中看到了什么?我并不是在问更高的意义或印象,而是客观地问,你在这幅画中看到了什么?我看到一张床,两把椅子、一张小桌子、一扇窗户、桌子上的一个水罐以及许多其他物品。我怀疑你也看到同样的情况。你看到了床、两把椅子和桌子,但是怎么看?光子(光粒子)从图像传播到您的眼睛,并在您的大脑中转换成离散的物体。再说一遍,如何?
What do you see in the painting? I’m not asking about a higher meaning or impression, but objectively, what do you see in the painting? I see a bed, two chairs, a small table, a window, and a pitcher on the table, among many other items. I suspect you see the same. You saw the bed, two chairs, and table, but how? Photons, particles of light, traveled from the image to your eye and were converted into discrete objects in your brain. Again, how?
我正在提出问题,但尚未提供答案。没关系,有两个原因。首先,思考将图像分割成有意义的对象集合的问题值得我们付出一些努力。其次,目前还没有人知道“如何?”的完整答案。然而,神经科学家确实了解这个过程的开始。
I’m asking questions but not yet offering answers. That’s okay for two reasons. First, pondering the problem of segmenting an image into a collection of meaningful objects is worth some effort on our part. Second, no one yet knows the full answer to “how?” Neuroscientists do, however, understand the beginnings of the process.
我们理所当然地认为能够查看场景并将其解析为单独且可识别的对象。对我们来说,这个过程毫不费力,完全自动。我们不应该被愚弄。我们是数亿年进化修补的受益者。对于哺乳动物来说,视觉始于眼睛,但解析和理解始于我们大脑后部的初级视觉皮层。
We take for granted the ability to look at a scene and parse it into separate and identified objects. For us, the process is effortless, completely automatic. We shouldn’t be fooled. We’re the beneficiaries of hundreds of millions of years of evolution’s tinkering. For mammals, vision begins in the eye, but parsing and understanding begins in the primary visual cortex at the back of our brains.
初级视觉皮层(称为 V1 区)对边缘和方向敏感。我们立即发现了关于视觉在大脑(而不是眼睛)中如何工作的线索。大脑接收输入感觉,将其作为扭曲图像分布在 V1 上,并首先寻找边缘和边缘的方向。 V1 对颜色也很敏感。将整个视野映射到 V1 上,并进行放大,使 V1 的大部分被我们视野的中央 2% 占据,这意味着边缘检测、方向和颜色都位于它们发生的地方。
The primary visual cortex, known as area V1, is sensitive to edges and orientation. Immediately, we encounter a clue to how vision works in the brain (as opposed to the eye). The brain takes the input sensations, spread over V1 as a warped image, and begins by seeking edges and the orientation of the edges. V1 is additionally sensitive to color. Mapping the entire visual field over V1, with magnification so that most of V1 is occupied by the central 2 percent of our visual field, means that edge detection, orientation, and color are local to where they occur.
V1 将其检测结果发送到区域 V2,区域 V2 将其检测结果发送到区域 V3,以此类推,通过 V4 到 V5,每个区域本质上接收视野中更大、更分组的元素的表示。该过程从 V1 开始,最终提供眼睛所见内容的完全解析和理解的表示。如前所述,V1 以外的细节是模糊的,但出于我们的目的,我们需要记住的是 V1 对边缘、边缘的方向和颜色(我们可能还包括纹理)敏感。简单地开始并分组到场景中的单独对象就是游戏的名称。 CNN 模仿了这个过程。可以公平地说,CNN 确实学会了通过输入来观察世界。
V1 sends its detections to area V2, which sends its detections to area V3, and so on through V4 to V5, with each area receiving, essentially, a representation of larger and more grouped elements of what is in the visual field. The process starts with V1 and, eventually, delivers a fully parsed and understood representation of what the eyes see. As mentioned, the details much beyond V1 are murky, but for our purposes all we need to remember is that V1 is sensitive to edges, the orientation of edges, and colors (we might also include textures). Starting simply and grouping to separate objects in the scene is the name of the game. CNNs mimic this process. It’s fair to say that CNNs literally learn to see the world of their inputs.
CNN 将输入分解为小部分,然后是部分组和更大的部分组,直到整个输入从一个整体转换为一种新的表示形式:一种更容易理解的表示形式,相当于传统的神经网络坐姿在模型的顶部。然而,将输入映射到新的、更容易理解的表示并不意味着新的表示更容易被我们理解。
CNNs decompose inputs into small parts, then groups of parts and still larger groups of groups of parts, until the entire input is transformed from a single whole into a new representation: one that is more easily understood by what amounts to a traditional neural network sitting at the top of the model. However, mapping the input to a new, more easily understood representation does not imply that the new representation is more easily understood by us.
卷积神经网络在训练过程中学习将输入划分为多个部分,从而使网络的顶层能够成功分类。换句话说,CNN 学习输入的新表示,然后对这些新表示进行分类。事实上,“从旧的表示中学习新的表示”是本章的早期标题。
Convolutional neural networks learn during training to partition inputs into parts, enabling the top layers of the network to classify successfully. In other words, CNNs learn new representations of their inputs and then classify those new representations. Indeed, “Learning New Representations from Old” was an early title for this chapter.
CNN 如何将输入分解为多个部分?要回答这个问题,我们首先要了解“卷积神经网络”的“卷积”部分。请注意,前面有低级细节。
How do CNNs break their inputs into parts? To answer that question, we must first understand the “convolution” part of “convolutional neural network.” Be warned, low-level details ahead.
****
****
卷积是一种数学运算,其正式定义涉及积分。对我们来说幸运的是,卷积在数字图像中是一种简单的运算,只使用乘法和加法。在图像上从上到下、从左到右滑动一个小方块(称为内核) 。在每个位置,卷积将正方形覆盖的像素值与相应的核值相乘。然后,它将所有这些乘积相加,生成一个数字,该数字成为该位置的输出像素值。文字到此为止,所以让我们尝试一下图片。考虑图 5-3。
Convolution is a mathematical operation with a formal definition involving integral calculus. Fortunately for us, convolution is a straightforward operation in digital images, using nothing more than multiplication and addition. Convolution slides a small square, known as a kernel, over the image from top to bottom and left to right. At each position, convolution multiplies the pixel values covered by the square with the corresponding kernel values. It then sums all those products to produce a single number that becomes the output pixel value for that position. Words only go so far here, so let’s try a picture. Consider Figure 5-3.
图 5-3:在图像上卷积核
Figure 5-3: Convolving a kernel over an image
图 5-3的左侧显示了一个数字网格。这些是图 5-4中图像中心部分的像素值。灰度像素值通常在 0 到 255 范围内,其中值越低越暗。内核是右侧的 3×3 网格。卷积运算指示我们将每个像素值乘以相应的核值。这会产生最右边的 3×3 数字网格。最后一步将所有九个值相加以创建单个输出 48,它替换输出图像中的中心像素 60 → 48。
The left side of Figure 5-3 shows a grid of numbers. These are the pixel values for the center portion of the image in Figure 5-4. Grayscale pixel values are typically in the range 0 through 255, where lower values are darker. The kernel is the 3×3 grid to the right. The convolution operation instructs us to multiply each pixel value by the corresponding kernel value. This produces the rightmost 3×3 grid of numbers. The final step sums all nine values to create a single output, 48, which replaces the center pixel in the output image, 60 → 48.
要完成卷积,请将 3×3 实心框向右滑动一个像素并重复。当到达一行的末尾时,将框向下移动一个像素并重复下一行,然后逐行处理,直到内核覆盖整个图像。卷积图像是新输出像素的集合。
To complete the convolution, slide the 3×3 solid box one pixel to the right and repeat. When the end of a row is reached, move the box down one pixel and repeat for the next row, then process row-by-row until the kernel has covered the entire image. The convoluted image is the collection of new output pixels.
乍一看,卷积似乎是一件奇怪的事情。然而,在数字图像中,卷积是基本运算。适当定义的内核可以让我们过滤图像以通过各种方式增强它。例如,图 5-4显示了四张图像。左上是原始图像,是英国沙夫茨伯里金山的常用测试图像。其余三幅图像是原始图像的过滤版本。从右上角顺时针方向,我们有一个模糊版本,一个显示水平边缘,一个显示垂直边缘。每个图像都是通过卷积生成的内核如前所述。图 5-3的内核生成右下角的水平边缘图像。将内核旋转 90 度,您将得到左下角的垂直边缘图像。最后,将所有核值设置为1,就可以得到右上角的模糊图像。请注意,边缘图像被反转以使检测到的边缘变成黑色而不是白色。
At first, convolution might seem like a strange thing to do. However, in digital images, convolution is a fundamental operation. An appropriately defined kernel lets us filter an image to enhance it in various ways. For example, Figure 5-4 shows four images. The upper left is the original image, a frequently used test image of Gold Hill in Shaftesbury, England. The remaining three images are filtered versions of the original. Clockwise from the upper right, we have a blurred version, one showing horizontal edges, and one showing vertical edges. Each image is produced by convolving a kernel as described previously. The kernel of Figure 5-3 produces the horizontal-edge image at the lower right. Rotate the kernel by 90 degrees, and you get the vertical-edge image at the lower left. Finally, make all the kernel values 1, and you get the blurred image at the upper right. Note that the edge images are inverted to make the detected edges black instead of white.
图 5-4:运行中的卷积核
Figure 5-4: Convolution kernels in action
我们要记住的关键点是,用不同的内核对图像进行卷积可以突出图像的不同方面。不难想象一组适当的内核提取与正确分类图像相关的结构。这正是 CNN 在端到端训练期间所做的事情,从某种意义上说,这正是我们的视觉系统在 V1 区域检测边缘、方向、颜色和纹理时所做的事情。
The critical point for us to remember is that convolving an image with different kernels highlights different aspects of the image. It isn’t hard to imagine an appropriate set of kernels extracting structure relevant to correctly classifying the image. This is exactly what CNNs do during end-to-end training and, in a sense, what our visual system does in area V1 when it detects edges, orientations, colors, and textures.
我们正在取得进展。现在我们已经掌握了 CNN 的核心操作——卷积,所以让我们下一步学习如何在模型中使用卷积来提取结构并构建新的输入表示。
We’re making progress. We now have a handle on the core operation of a CNN, convolution, so let’s take the next step to learn how convolution is used within a model to extract structure and build a new representation of the input.
****
****
第 4 章的传统神经网络由单一类型的层组成:完全连接的节点的集合,接受来自下层的输入,为上层产生输出。卷积神经网络更加灵活并且支持多种层类型。无论如何,数据流是相同的:从输入到一层又一层到网络的输出。
The traditional neural networks of Chapter 4 consist of a single kind of layer: a collection of fully connected nodes accepting input from the layer below to produce output for the layer above. Convolutional neural networks are more flexible and support diverse layer types. Regardless, the data flow is the same: from input to layer after layer to the network’s output.
用 CNN 的话说,传统神经网络使用的全连接层称为密集层。 CNN 通常在顶部、附近使用密集层输出,因为那时网络已将输入转换为一种新的表示形式,全连接层可以成功分类。 CNN 大量使用卷积层和池化层。
In CNN parlance, the fully connected layers a traditional neural network uses are called dense layers. CNNs usually use dense layers at the top, near the output, because by that time the network has transformed the input into a new representation, one that the fully connected layers can classify successfully. CNNs make heavy use of convolutional layers and pooling layers.
卷积层将一组内核应用于其输入以产生多个输出,就像图 5-4从左上角的一个输入图像产生三个输出一样。在训练过程中使用我们在第 4 章中遇到的相同的反向传播和梯度下降方法来学习内核。学习到的核的值是卷积层的权重。
Convolutional layers apply a collection of kernels to their input to produce multiple outputs, much as Figure 5-4 produced three outputs from the one input image at the upper left. The kernels are learned during training using the same backpropagation and gradient descent approach we encountered in Chapter 4. The values of the learned kernels are the weights of the convolutional layer.
池化层没有与之相关的权重。没什么可学的。相反,池化层对其输入执行固定操作:它们通过保持 2×2 正方形中的最大值在不重叠的情况下移动来减少输入的空间范围。最终效果类似于将图像尺寸缩小两倍。图 5-5说明了将 8×8 输入更改为 4×4 输出的过程,并在每个实心方块中保留最大值。池化层是减少网络参数数量的一种让步。
Pooling layers have no weights associated with them. There’s nothing to learn. Rather, pooling layers perform a fixed operation on their inputs: they reduce the spatial extent of their inputs by keeping the largest value in a 2×2 square moved without overlap across and then down. The net effect is similar to reducing the size of an image by a factor of two. Figure 5-5 illustrates the process of changing an 8×8 input into a 4×4 output, keeping the maximum value in each solid square. Pooling layers are a concession to reduce the number of parameters in the network.
图 5-5:池化以减少数据的空间范围
Figure 5-5: Pooling to reduce the spatial extent of the data
典型的 CNN 结合了卷积层和池化层,然后再添加一两个密集层。 ReLU 层也被使用,通常在卷积层和密集层之后。例如,称为 LeNet 的经典 CNN 架构由以下层组成:
A typical CNN combines convolutional and pooling layers before topping things off with a dense layer or two. ReLU layers are used as well, usually after the convolutional and dense layers. For example, a classic CNN architecture known as LeNet consists of the following layers:
该模型使用三个卷积层、两个池化层和一个具有 84 个节点的密集层。每个卷积层和密集层后面都有一个 ReLU 层,用于将所有负输入映射为零,同时保持所有正输入不变。
The model uses three convolutional layers, two pooling layers, and a single dense layer with 84 nodes. Each convolutional and dense layer is followed by a ReLU layer to map all negative inputs to zero while leaving all positive inputs untouched.
每个卷积层括号中的数字是该层中要学习的滤波器的数量。滤波器是卷积核的集合,每个输入通道有一个核。例如,第一个卷积层学习六个滤波器。输入是一个通道的灰度图像,因此这一层学习六个内核。第二个卷积层学习 16 个滤波器,每个滤波器有 6 个内核,每个内核对应第一个卷积层的 6 个输入通道。因此,第二个卷积层总共学习了 96 个核。最后,最后一个卷积层学习 120 个滤波器,每个滤波器有 16 个内核,另外还有 1,920 个内核。总而言之,LeNet 模型需要学习 2,022 个不同的卷积核。
The number in parentheses for each convolutional layer is the number of filters to learn in that layer. A filter is a collection of convolutional kernels, with one kernel for each input channel. For example, the first convolutional layer learns six filters. The input is a grayscale image with one channel, so this layer learns six kernels. The second convolutional layer learns 16 filters, each with 6 kernels, one for each of the 6 input channels from the first convolutional layer. Therefore, the second convolutional layer learns a total of 96 kernels. Finally, the last convolutional layer learns 120 filters, each with 16 kernels, for another 1,920 kernels. All told, the LeNet model needs to learn 2,022 different convolutional kernels.
希望学习如此多的内核能够产生一系列输出,捕获输入中结构的基本元素。如果训练成功,最终卷积层的输出(作为密集层的向量输入)将包含清楚地区分类别的值,至少比单独使用图像更清楚。
The hope is that learning so many kernels will produce a sequence of outputs that capture essential elements of the structures in the input. If training is successful, the output of the final convolutional layer, as a vector input to the dense layer, will contain values that clearly differentiate between classes—at least, more clearly than can be accomplished by using the image alone.
如果感觉我们身处杂草之中,我们确实如此,但我们不会进一步挖掘。事实上,我们已经达到了本书中考虑的最低细节级别,但这是一个必要的负担,因为如果我们不了解卷积和卷积层,我们就无法理解 CNN 的工作原理。
If it feels like we’re in the weeds, we are, but we will not dig further. We’ve reached the lowest level of detail we’ll consider in the book, in fact, but it’s a necessary burden, as we cannot understand how CNNs work if we don’t understand convolution and convolutional layers.
也许了解 CNN 各层正在做什么的最佳方法是查看它们对流经网络的数据的影响。图 5-6显示了在 MNIST 数字上训练的 LeNet 模型如何操作两个输入图像。第一个卷积层的输出是中间的六幅图像,其中灰色代表零,较暗的像素逐渐为负值,较亮的像素逐渐为正值。第一个卷积层的六个内核分别为单个输入图像生成一个输出图像。当从暗到亮的过渡时,内核突出显示输入的不同部分。
Perhaps the best way to understand what the layers of a CNN are doing is to look at their effect on data flowing through the network. Figure 5-6 shows how a LeNet model trained on MNIST digits manipulates two input images. The output of the first convolutional layer is the six middle images, where gray represents zero, darker pixels are increasingly negative, and lighter pixels are increasingly positive. The six kernels of the first convolutional layer each produce an output image for the single input image. The kernels highlight different portions of the inputs as transitions from dark to light.
图 5-6:第一个卷积层到密集层的输入
Figure 5-6: Input to first convolutional layer to dense layer
最右边的类似条形码的图案是密集层输出的表示。我们忽略第二个和第三个卷积层的输出并直接跳到模型的末尾。密集层的输出是一个由 84 个数字组成的向量。对于图 5-6,我将这些数字映射到像素值,其中较大的值对应于较暗的垂直条。
The rightmost barcode-like pattern is a representation of the dense layer’s output. We’re ignoring the output of the second and third convolutional layers and jumping directly to the end of the model. The dense layer’s output is a vector of 84 numbers. For Figure 5-6, I mapped these numbers to pixel values, where larger values correspond to darker vertical bars.
请注意,数字 0 和 8 的条形码不同。如果模型学习得好,我们可能期望密集层输出的条形码能够在数字之间共享共性。换句话说,零的条形码看起来应该大致相似,八的条形码也应该相似。他们吗?考虑图 5-7。
Notice that the barcodes for the digits 0 and 8 differ. If the model learned well, we might expect the barcodes for the dense layer outputs to share commonalities across digits. In other words, the barcodes for zeros should look roughly similar, as should the barcodes for eights. Do they? Consider Figure 5-7.
图 5-7:样本输入的密集层输出
Figure 5-7: Dense layer output for sample inputs
该图显示了五个不同的零和八个输入的密集层输出。条形码各不相同,但根据数字有相似之处。对于零尤其如此。 LeNet 模型已经学会了如何将每个 28×28 像素输入图像(784 像素)映射到由 84 个数字组成的向量,这些数字按数字类型显示出很强的相似性。根据我们对传统神经网络的经验,我们可以理解这种映射产生了较低维度的东西,保留甚至强调了数字之间的差异。学习到的低维向量类似于用一些精心选择的单词来解释的复杂概念。这正是我们想要 CNN 做的事情。经过训练的模型学会了“看到”以小灰度图像表示的手写数字的世界。灰度图像也没有什么特别的。 CNN 非常乐意处理由红色、绿色和蓝色通道或任意数量的通道表示的彩色图像,就像使用多波段卫星图像时一样。
This figure presents the dense layer outputs for five different zero and eight inputs. The barcodes are all different but share similarities according to digit. This is especially true for the zeros. The LeNet model has learned how to map each 28×28-pixel input image (784 pixels) into a vector of 84 numbers that show strong similarities by digit type. Based on our experience with traditional neural networks, we can appreciate that this mapping has produced something of lower dimensionality that preserves and even emphasizes differences between digits. The learned lower-dimensionality vector is akin to a complex concept explained with a few well-chosen words. This is exactly what we want a CNN to do. The trained model learned to “see” in the world of handwritten digits represented as small grayscale images. There’s nothing special about grayscale images, either. CNNs are quite happy to work with color images represented by red, green, and blue channels, or any number of channels, as when using multiband satellite imagery.
我们可以这样思考该模型:密集层之前的 CNN 层学习如何充当从输入图像生成输出向量的函数。真正的分类器是顶部的密集层,但它效果很好,因为 CNN 学习了分类器(密集层),同时学习了映射函数。
We might think of the model this way: the CNN layers before the dense layer learned how to act as a function producing an output vector from the input image. The true classifier is the dense layer at the top, but it works well because the CNN learned the classifier (dense layer) while simultaneously learning the mapping function.
我之前说过,CNN 中的更高层会关注输入的更大部分。通过考虑在更深层次上影响内核输出的输入部分,我们可以看到这一点。图 5-8展示了这种效果。
I stated earlier that higher layers in the CNN pay attention to ever larger parts of the input. We can see that this is so by considering the portion of the input that influences the output of a kernel at a deeper layer. Figure 5-8 demonstrates this effect.
图 5-8:影响模型更深层的输入部分
Figure 5-8: The part of the input affecting deeper layers of the model
从图像的右侧开始。 3×3 的方格表示卷积层 1 处内核的输出。我们想知道输入的哪一部分影响阴影像素的值。看看前面的卷积层,第 0 层,我们看到第 1 层的输出取决于来自前一层的九个阴影值。
Begin on the right side of the image. The 3×3 grid of squares represents the output of a kernel at convolutional layer 1. We want to know what portion of the input influences the value of the shaded pixel. Looking at the previous convolutional layer, layer 0, we see that the layer 1 output depends on the nine shaded values coming from the layer before.
卷积层 0 的九个阴影值取决于输入的 5×5 阴影区域。它是 5×5,因为九个值中的每一个都是通过在输入的阴影 5×5 区域上滑动 3×3 内核来找到的。例如,第 0 层中中间值的虚线部分来自输入的类似阴影的 3×3 区域。通过这种方式,较高的 CNN 层会受到越来越大的输入部分的影响。其技术术语是有效感受野,其中图 5-8中最右边阴影值的有效感受野是输入的 5×5 阴影区域。
The nine shaded values of convolutional layer 0 depend on the 5×5 shaded region of the input. It’s 5×5 because each of the nine values is found by sliding a 3×3 kernel over the shaded 5×5 region of the input. For example, the dotted portion of the middle value in layer 0 comes from the similarly shaded 3×3 region of the input. In this way, higher CNN layers are affected by larger and larger portions of the input. The technical term for this is the effective receptive field, where the effective receptive field of the rightmost shaded value in Figure 5-8 is the 5×5 shaded region of the input.
****
****
是时候进行实验了。现在我们已经了解了 CNN 的工作原理,因此让我们运用这些知识来比较传统神经网络与卷积模型。哪个会赢?我怀疑您已经知道答案,但让我们证明这一点并在此过程中获得一些经验。
It’s time for an experiment. We now have a handle on how CNNs function, so let’s put that knowledge to work to compare a traditional neural network with a convolutional model. Which will win? I suspect you already know the answer, but let’s prove it and gain some experience along the way.
我们需要一个数据集。让我们使用 CIFAR-10 的灰度版本。这是比我们在前两章中使用的恐龙足迹数据集更好的选择,因为足迹图像是没有纹理和背景的轮廓,并且 CNN 不会比传统模型从此类图像中学到更多。正如我们在第 3 章中了解到的,CIFAR-10 包含动物和车辆的 32×32 像素图像,这可能更具挑战性。
We need a dataset. Let’s use a grayscale version of CIFAR-10. This is a better choice than the dinosaur footprint dataset we used in the previous two chapters because the footprint images are outlines devoid of texture and background, and a CNN will not learn much more from such images than a traditional model. As we learned in Chapter 3, CIFAR-10 contains 32×32-pixel images of animals and vehicles, which will likely be more challenging.
我们将训练三个模型:随机森林、传统神经网络和卷积神经网络。这足够了吗?我们已经认识到所有这三个模型都涉及随机性,因此训练一次可能无法让我们公平地了解每个模型的表现。毕竟,我们可能会得到糟糕的初始化或混合树,从而导致其中一个模型失败。因此,我们将每个模型训练 10 次并对结果取平均值。
We’ll train three models: a random forest, a traditional neural network, and a convolutional neural network. Is this sufficient? We’ve come to appreciate that all three of these models involve randomness, so training once might not give us a fair representation of how each model performs. After all, we might get a lousy initialization or mix of trees that would throw one of the models off. Therefore, let’s train each model 10 times and average the results.
这个实验将帮助我们理解模型之间性能的差异,但我们可以了解更多关于神经网络的知识随着训练的进行跟踪他们的错误。结果是一个图表,我将展示并稍后解释。不过,在此之前,让我先介绍一下模型的详细信息。
This experiment will help us understand the differences in performance between the models, but we can learn still more about the neural networks by tracking their errors as training progresses. The result is a graph, which I’ll present and then explain shortly. Before that, however, let me lay out the details of the models.
每个模型的训练和测试数据集都是相同的。传统的神经网络和随机森林需要矢量输入,因此每个 32×32 像素的图像都会分解为 1,024 个数字的矢量。 CNN 处理实际的二维图像。训练集中有 50,000 张图像,10 个类中每类 5,000 张;测试集中有 10,000 张图像,每个类 1,000 张。
The training and test datasets are the same for each model. The traditional neural network and the random forest require vector inputs, so each 32×32-pixel image is unraveled into a vector of 1,024 numbers. The CNN works with the actual two-dimensional images. There are 50,000 images in the training set, 5,000 for each of the 10 classes, and 10,000 images in the test set, 1,000 per class.
随机森林使用 300 棵树。传统的神经网络有两个隐藏层,分别为 512 个和 100 个节点。 CNN 更复杂,有四个卷积层、两个池化层和一个包含 472 个节点的密集层。尽管 CNN 具有更多层,但要学习的权重和偏差总数几乎与传统模型相同:577,014 与 577,110。
The random forest uses 300 trees. The traditional neural network has two hidden layers of 512 and 100 nodes, respectively. The CNN is more complex, with four convolutional layers, two pooling layers, and a single dense layer of 472 nodes. Even though the CNN has many more layers, the total number of weights and biases to learn is nearly identical to the traditional model: 577,014 versus 577,110.
我们将训练神经网络 100 个时期,这意味着 100 次通过完整的训练集。将小批量大小固定为 200 可以让我们每个 epoch 执行 250 个梯度下降步骤。因此,在训练期间,我们将更新网络的权重和偏差 25,000 次。在每个时期结束时,我们将捕获模型在训练集和测试集上产生的错误。当尘埃落定后,一张图表将揭示我们想知道的一切。
We’ll train the neural networks for 100 epochs, meaning 100 passes through the full training set. Fixing the minibatch size at 200 gives us 250 gradient descent steps per epoch. Therefore, during training, we’ll update the weights and biases of the networks 25,000 times. At the end of each epoch, we’ll capture the error made by the model on both the training and test sets. When the dust settles, a single graph will reveal everything we want to know.
图 5-9就是该图。这是我们见过的最复杂的图表,所以让我们从轴开始详细介绍它。
Figure 5-9 is that graph. It’s the most complex graph we’ve seen, so let’s walk through it in detail, beginning with the axes.
图 5-9:CNN、MLP 和随机森林的 CIFAR-10 结果
Figure 5-9: CIFAR-10 results for a CNN, MLP, and random forest
横轴(x轴)上的标签是“epoch”,这意味着完整地遍历训练集。因此,该图显示了每个时期后训练过程中发生的变化。我们还知道,每个时代代表 250 个梯度下降步骤。垂直轴(y轴)标记为“误差”,范围为 0.1 到 0.8。该轴表示模型出错的测试或训练样本的比例。误差越低越好。小数值 0.1 表示 10%,值 0.8 表示 80%。
The label on the horizontal axis (x-axis) is “epoch,” which means a complete pass through the training set. Therefore, the graph shows things changing during training after every epoch. We also know that each epoch represents 250 gradient descent steps. The vertical axis (y-axis) is labeled “error” and runs from 0.1 to 0.8. This axis represents the fraction of the test or training samples that the model gets wrong. The lower the error, the better. A decimal value of 0.1 means 10 percent, and a value of 0.8 means 80 percent.
图表右上角的图例告诉我们,圆形和正方形与 MLP(传统神经网络)相关,而三角形和五边形则与 CNN 相关。具体来说,当模型训练时,圆圈和三角形分别跟踪 MLP 和 CNN 测试集上的误差。类似地,正方形和五边形跟踪训练集上的误差。回想一下,模型在训练集上的性能用于更新权重和偏差。测试集用于评估,不会影响模型的训练方式。
The legend in the upper-right corner of the graph tells us that the circles and squares relate to the MLP, the traditional neural network, while the triangles and pentagons refer to the CNN. Specifically, the circles and triangles track the error on the test set for the MLP and CNN, respectively, as the models train. Similarly, the squares and pentagons track the error on the training set. Recall that the model’s performance on the training set is used to update the weights and biases. The test set is used for evaluation and does not contribute to how the model is trained.
MLP 图向我们展示了随着训练的不断进行,模型学习训练集(正方形)和测试集(圆形)的效果如何。很明显,模型对训练集的学习效果优于对测试集的学习,因为训练集误差不断减小。这就是我们所期望的。梯度下降算法将更新 MLP 的权重和偏差(总共 577,110 个),以在训练集上获得越来越低的误差。然而,我们对训练集上达到零错误不感兴趣;相反,我们希望测试集上的误差尽可能最小,因为这让我们有理由相信 MLP 已经学会了泛化。
The MLP plots show us how well the model learned the training set (squares) and the test set (circles) as training continued, epoch after epoch. It’s immediately apparent that the model learned the training set better than the test set because the training set error decreases continuously. This is what we expect. The gradient descent algorithm will update the weights and biases of the MLP, all 577,110 of them, to arrive at a lower and lower error on the training set. However, we’re not interested in reaching zero error on the training set; instead, we want the smallest error possible on the test set because that gives us a reason to believe that the MLP has learned to generalize.
现在考虑向我们显示测试集误差的圆形图。在 40 个时期左右,它达到最小值约 0.56,即 56%。此后,误差缓慢但稳定地增加,最多 100 个 epoch。这种效应是经典的 MLP 过度拟合。训练集误差继续减小,但测试集误差达到最小值,然后继续增大。图 5-9告诉我们,在 40 个 epoch 时停止训练会给我们带来最佳性能的 MLP。
Now consider the circle plot showing us the test set error. It reaches a minimum of about 0.56, or 56 percent, at around 40 epochs. After that, the error increases slowly but steadily, up to 100 epochs. This effect is classic MLP overfitting. The training set error continues to decrease, but the test set error hits a minimum and continues to increase after that. Figure 5-9 tells us that stopping training at 40 epochs would have given us the best-performing MLP.
我们将得到 CNN 结果,但目前,考虑错误率为 58% 的虚线。它标记为“RF300”,向我们展示了包含 300 棵树的随机森林的测试集误差。随机森林不会通过更新历元的权重来学习,因此 58% 的误差就是:模型的误差。我将其绘制为与水平轴平行的虚线,因此您可以简单地看到 MLP 的表现略好于随机森林,但到 100 个 epoch 后,两个模型之间的差异可以忽略不计。换句话说,我们可能认为经典机器学习在灰度 CIFAR-10 数据集上的最大努力的误差约为 56% 到 58%。这不是一个好的结果。在随机森林或 MLP 的参数上花费额外的时间,或者从支持向量机开始可能会导致误差略有减少。尽管如此,它仍然不太可能克服经典机器学习不能对这个数据集做太多事情的事实。
We’ll get to the CNN results, but for the moment, consider the dashed line at 58 percent error. It’s labeled “RF300” and shows us the test set error from a random forest with 300 trees. The random forest doesn’t learn by updating weights over epochs, so the 58 percent error is just that: the model’s error. I plotted it as a dashed line parallel to the horizontal axis so you can see that, briefly, the MLP did slightly better than the random forest, but by 100 epochs, the difference between the two models was negligible. In other words, we might take it that classical machine learning’s best effort on the grayscale CIFAR-10 dataset is an error of about 56 to 58 percent. That’s not a good result. Additional time spent with the parameters of the random forest, or the MLP, or starting over with a support vector machine might lead to a slight reduction in the error. Still, it’s unlikely to overcome the fact that classical machine learning cannot do much with this dataset.
最后,考虑 CNN 的训练(五边形)和测试(三角形)曲线。到 100 个 epoch 时,CNN 在训练集上的错误率约为 11%,更重要的是,在测试集上的错误率约为 23%。换句话说,CNN 的正确率是 77%,即 10 次中有近 8 次。在 10 类数据集上,随机猜测的正确率约为 10%,因此 CNN 学习得相当好,远远优于 MLP 或随机猜测森林。
Finally, consider the CNN’s training (pentagon) and test (triangle) curves. By 100 epochs, the CNN is right around 11 percent error on the training set and, more importantly, about 23 percent on the test set. In other words, the CNN is right 77 percent of the time, or nearly 8 times in 10. Random guessing will be correct about 10 percent of the time on a 10-class dataset, so the CNN has learned rather well, and far better than the MLP or random forest.
这正是卷积神经网络的要点:通过学习表示图像中对象的部分,可以学习网络密集层可以成功分类的新表示(正式称为嵌入)。
This is precisely the point of convolutional neural networks: by learning to represent the parts of the objects in an image, it becomes possible to learn a new representation (formally known as an embedding) that the dense layers of the network can successfully classify.
我在 2015 年训练的第一个 CNN 尝试检测卫星图像中的小型飞机。我最初的非 CNN 方法有效,但存在很多误报(假检测)的噪声。飞机在那里,但还有许多其他非飞机的东西。然后我训练了一个简单的 CNN,就像本实验中使用的那样。它轻松地定位了飞机,几乎除了飞机之外什么也没有。我惊呆了,然后意识到深度学习是一种范式转变。我将在第 7 章中指出,从 2022 年秋季开始,一种新的、更深刻的范式转变已经发生,但在我们准备好进行讨论之前,我们还有一些基础知识需要讨论。
The first CNN I trained, in 2015, attempted to detect small airplanes in satellite images. My initial, non-CNN approach worked, but it was noisy with many false positives (fake detections). The airplanes were there, but so were many other things that were not airplanes. I then trained a simple CNN like the one used in this experiment. It located the airplanes with ease, and virtually nothing but the airplanes. I was dumbfounded and realized then that deep learning was a paradigm shift. I’ll argue in Chapter 7 that as of fall 2022, a new, more profound paradigm shift has occurred, but we have some ground yet to cover before we’re ready for that discussion.
****
****
本章的简单 CNN 并没有充分展示可用的神经网络架构。十年的狂热发展已经产生了一些首选的 CNN 架构,其中一些具有超过 100 层。这些架构的名称包括 ResNet、DenseNet、Inception、MobileNet 和 U-Net 等。 U-Net值得多说几句。
The simple CNNs of this chapter don’t do justice to the zoo of available neural network architectures. A decade of fevered development has resulted in a few go-to CNN architectures, some with over 100 layers. The architectures have names like ResNet, DenseNet, Inception, MobileNet, and U-Net, among many others. The U-Net is worthy of a few words.
到目前为止,我们探索的 CNN 接受输入图像并返回“狗”或“猫”等类别标签。事情并不需要这样。一些 CNN 架构实现语义分割,其中输出是另一幅图像,每个像素都由其所属的类标记。 U-Nets就是这么做的。如果狗的每个像素都被标记为“狗”,那么从图像中提取狗就变得微不足道。 U-Net 和为整个图像分配单个标签的 CNN 之间的中间立场是输出边界框(包围检测到的对象的矩形)的模型。人工智能的普及意味着您可能已经见过带有标记边界框的图像。 YOLO(“你只看一次”)是一种流行的架构,可生成带标签的边界框; Faster R-CNN 是另一个。
The CNNs we’ve explored so far accept an input image and return a class label like “dog” or “cat.” It doesn’t need to be this way. Some CNN architectures implement semantic segmentation, where the output is another image with every pixel labeled by the class to which it belongs. U-Nets do this. If every pixel of the dog is marked “dog,” extracting the dog from the image becomes trivial. A middle ground between a U-Net and CNNs that assign a single label to the entire image is a model that outputs a bounding box, a rectangle surrounding the detected object. The pervasiveness of AI means that you’ve likely already seen images with labeled bounding boxes. YOLO (“you only look once”) is a popular architecture producing labeled bounding boxes; Faster R-CNN is another.
我们在这里关注图像输入,但输入不一定是图像。任何可以以类似图像的格式表示的东西,其中有二维和这些维度内的结构,都是 2D CNN 的候选者。一个很好的例子是音频信号,我们通常将其视为一维,即驱动扬声器的随时间变化的电压。然而,音频信号包含不同频率的能量。不同频率的能量可以二维显示:水平维度是时间,垂直维度是频率,通常较低频率在底部,较高频率在顶部。每个频率的强度变成像素的强度,将音频信号从一维、时变的电压转换成二维频谱图,如图5-10所示。
We focused on image inputs here, but the input need not be an image. Anything representable in an image-like format, where there are two dimensions and structure within those dimensions, is a candidate for a 2D CNN. A good example is an audio signal, which we usually think of as one-dimensional, a voltage changing over time that drives the speaker. However, audio signals contain energy at different frequencies. The energy at different frequencies can be displayed in two dimensions: the horizontal dimension is time, and the vertical dimension is frequency, usually with lower frequencies at the bottom and higher frequencies at the top. The intensity of each frequency becomes the intensity of a pixel to transform the audio signal from a one-dimensional, time-varying voltage into a two-dimensional spectrogram, as shown in Figure 5-10.
图 5-10:一维数据到二维图像的映射
Figure 5-10: Mapping from one-dimensional data to a two-dimensional image
这张哭闹婴儿的声谱图包含丰富的信息和结构,CNN 可以了解这些信息和结构,从而生成比单独使用一维音频信号更好的模型。关键的观察结果是,以适合 CNN 的形式提取结构的输入数据的任何转换都是公平的游戏。
The spectrogram, here of a crying baby, contains a wealth of information and structure that the CNN can learn about to produce a better model than is possible with the one-dimensional audio signal alone. The key observation is that any transformation of the input data that extracts structure in a form amenable to a CNN is fair game.
****
****
您有一个数据集,需要构建一个 CNN。您应该使用什么架构?小批量大小应该是多少?您需要哪些层以及按什么顺序?应该使用 5×5 还是 3×3 卷积核?训练多少个epoch就足够了?早期,在标准架构开发之前,每个问题都必须由设计网络的人来回答。这有点像过去的医学:科学、经验和直觉的结合。神经网络的艺术意味着对从业者的需求量很大,而精明的软件工程师很难将深度学习添加到他们的技能中。有人想知道是否可以使用软件来自动确定模型的架构和训练参数(即第 3 章介绍的超参数)。因此,自动机器学习(AutoML)诞生了。
You have a dataset and need to build a CNN. What architecture should you use? What should the minibatch size be? What layers do you need, and in what order? Should you use 5×5 or 3×3 convolutional kernels? How many epochs of training is enough? Early on, before the development of standard architectures, each of those questions had to be answered by the person designing the network. It was a bit like medicine of the past: a mix of science, experience, and intuition. The art of neural networks meant that practitioners were in high demand, and it was difficult for savvy software engineers to add deep learning to their repertoires. Some people wondered if software could be used to determine the model’s architecture and training parameters automatically (that is, its hyperparameters, introduced in Chapter 3). And so, automatic machine learning, or AutoML, was born.
大多数基于云的商业机器学习平台,例如微软的 Azure 机器学习或亚马逊的 SageMaker Autopilot,都包含一个 AutoML 工具,可以为您创建机器学习模型;您只需要提供数据集。 AutoML 不仅仅适用于神经网络,许多工具还包括经典的机器学习模型。 AutoML 的全部目的是用最少的用户专业知识为所提供的数据集找到最佳模型类型。
Most cloud-based commercial machine learning platforms, like Microsoft’s Azure Machine Learning or Amazon’s SageMaker Autopilot, include an AutoML tool that will create the machine learning model for you; you need only supply the dataset. AutoML applies to more than just neural networks, and many tools include classical machine learning models as well. AutoML’s entire purpose is to locate the best model type for the supplied dataset with a minimum of user expertise required.
我想说的是,AutoML 只能走这么远,最好的深度学习实践者总是会超越它,但这个论点听起来很空洞。它让我想起汇编语言程序员过去的断言,编译器不可能生成与他们可以生成的代码一样好或更好的代码。如今,汇编语言程序员的职位空缺很少,但使用编译语言的程序员却有数以万计的职位空缺(至少目前如此;请参阅第 8 章)。也就是说,我们中的一些人仍然更喜欢推出自己的模型。
I want to argue that AutoML only goes so far and that the best deep learning practitioners will always outperform it, but that argument rings hollow. It reminds me of the assembly language programmers of old pontificating on the impossibility of compilers ever producing code that was as good as or better than what they could produce. There are few job openings these days for assembly language programmers, but tens of thousands for programmers using compiled languages (at least for now; see Chapter 8). That said, some of us still prefer to roll our own models.
****
****
深度学习革命的结果是创建了强大的开源机器学习工具包,例如 TensorFlow 和 PyTorch。实现传统的、完全连接的神经网络是机器学习学生的一项练习。这并不是一件微不足道的事情,但这是大多数人通过努力就能完成的事情。另一方面,正确实现 CNN,尤其是支持多种层类型的 CNN,绝非易事。人工智能社区很早就致力于开发支持深度学习(包括 CNN)的开源工具包。如果没有这些工具包,人工智能的进展将极其缓慢。谷歌、Facebook (Meta) 和 NVIDIA 等大型科技公司也已签约,他们对工具包开发的持续支持对人工智能至关重要。
A consequence of the deep learning revolution was the creation of powerful, open source machine learning toolkits with names like TensorFlow and PyTorch. Implementing a traditional, fully connected neural network is an exercise for machine learning students. It’s not trivial, but it’s something most people can accomplish with effort. Properly implementing a CNN, on the other hand, especially one supporting a multitude of layer types, is anything but trivial. The AI community committed early on to developing open source toolkits supporting deep learning, including CNNs. Without these toolkits, progress in AI would be painfully slow. Large tech companies like Google, Facebook (Meta), and NVIDIA also signed on, and their continued support for toolkit development is critical to AI.
除了包含大量经过测试的高性能代码之外,工具包的强大之处还在于它们的灵活性。我们现在认识到,训练神经网络(CNN 或其他网络)需要两个步骤:反向传播和梯度下降。仅当模型的层支持称为微分的特定数学运算时,反向传播才有效。微分是微积分学生第一学期学习的内容。只要工具包可以自动确定导数(微分时得到的结果),它们就允许用户实现任意层。该工具包通过将神经网络转换为计算图来采用自动微分。
What makes the toolkits powerful, besides the mountains of tested, high-performance code they contain, is their flexibility. We now appreciate that training a neural network, CNN or otherwise, requires two steps: backpropagation and gradient descent. Backpropagation works only if the model’s layers support a particular mathematical operation known as differentiation. Differentiation is what first semester calculus students learn. So long as the toolkits can automatically determine the derivatives (what you get when you differentiate), they allow users to implement arbitrary layers. The toolkits employ automatic differentiation by transforming the neural network into a computational graph.
人们很容易沿着自动微分和计算图的道路迈出几步,因为其中的优雅和灵活性是数学和计算机科学的完美结合。不幸的是,您需要相信我的话,因为所需的详细程度远远超出了我们在本书中可以探讨的范围。一个关键点是自动微分有两种主要方法:正向和反向。前向自动微分更容易概念化并在代码中实现,但不适合神经网络。在某种程度上,这太糟糕了,因为前向自动微分最好使用对偶数来实现,这是英国数学家 William Clifford 于 1873 年发明(发现?)的一种晦涩的数字类型。这些是为了数学而数学的一个典型例子,但在很大程度上被遗忘了直到计算机时代,计算机突然变得有用。反向自动微分最适合神经网络,但不使用对偶数。
It’s tempting to take a few steps down the path of automatic differentiation and computational graphs because the elegance and flexibility therein is a beautiful marriage of mathematics and computer science. Unfortunately, you’ll need to take my word for it because the level of detail necessary is far beyond what we can explore in this book. One key point is that there are two primary approaches to automatic differentiation: forward and reverse. Forward automatic differentiation is easier to conceptualize and implement in code but is unsuited to neural networks. That’s too bad, in a way, because forward automatic differentiation is best implemented using dual numbers, an obscure type of number invented (discovered?) by English mathematician William Clifford in 1873. These were a prime example of math for math’s sake and largely forgotten until the age of computers, when they were suddenly made useful. Reverse automatic differentiation is best for neural networks but doesn’t use dual numbers.
****
****
这一章很有挑战性。我们比前几章或后续章节更深入地探讨了细节。总结是肯定需要的。卷积神经网络:
This chapter was challenging. We dove more deeply into the details than we did in previous chapters or will in the following ones. A summary is definitely required. Convolutional neural networks:
卷积神经网络遵循经典机器学习模型的传统:它们接受输入并以某种方式为其分配类别标签。该网络作为数学函数运行,接受输入并产生输出。下一章向我们介绍无需输入即可生成输出的神经网络。
Convolutional neural networks follow in the tradition of classical machine learning models: they take an input and assign to it, in some fashion, a class label. The network operates as a mathematical function, accepting an input and producing an output. The next chapter introduces us to neural networks that generate output without input.
套用一部老电视节目的话:你正在穿越另一个维度,不仅是视觉和声音的维度,而且是心灵的维度,这是一次进入奇妙土地的旅程,其边界是想象力——下一站是生成人工智能。
To paraphrase an old television show: you’re traveling through another dimension, a dimension not only of sight and sound but of mind, a journey into a wondrous land whose boundaries are that of imagination—next stop, generative AI.
关键术语
KEY TERMS
自动微分、AutoML、边界框、计算图、卷积、卷积层、卷积神经网络、密集层、有效感受野、嵌入、端到端学习、滤波器、内核、池化层、语义分割
automatic differentiation, AutoML, bounding box, computational graph, convolution, convolutional layer, convolutional neural network, dense layer, effective receptive field, embedding, end-to-end learning, filter, kernel, pooling layer, semantic segmentation
生成式人工智能是独立(随机)或基于用户提供的提示创建新颖输出的模型的总称。生成模型不产生标签,而是产生文本、图像甚至视频。从本质上讲,生成模型是由相同基本组件构建的神经网络。
Generative AI is an umbrella term for models that create novel output, either independently (randomly) or based on a prompt supplied by the user. Generative models do not produce labels but text, images, or even video. Under the hood, generative models are neural networks built from the same essential components.
我们将重点关注三种生成式人工智能模型:生成式对抗网络、扩散模型和大语言模型。本章涵盖前两部分。大型语言模型最近彻底改变了人工智能世界。它们是第 7 章的主题。
We’ll focus on three kinds of generative AI models: generative adversarial networks, diffusion models, and large language models. This chapter covers the first two. Large language models have recently turned the world of AI on its head. They are the subject of Chapter 7.
****
****
生成对抗网络(GAN)由两个一起训练的独立神经网络组成。第一个网络是生成器。它的任务是学习如何为鉴别器创建虚假输入。鉴别器的任务是学习如何区分虚假输入和真实输入。一起训练两个网络的目标是,生成器能够更好地伪造鉴别器,而鉴别器则尽力区分真假。
Generative adversarial networks (GANs) consist of two separate neural networks trained together. The first network is the generator. Its task is to learn how to create fake inputs for the discriminator. The discriminator’s task is to learn how to differentiate between fake and real inputs. The goal of training the two networks together is that the generator becomes better at faking out the discriminator while the discriminator tries its best to differentiate real from fake.
起初,发电机很糟糕。它输出噪声,鉴别器可以毫无困难地区分真假。然而,生成器随着时间的推移而改进,使得鉴别器的工作变得越来越困难;这反过来又推动鉴别器成为一个更好的真假检测器。当训练被宣布完成时,鉴别器通常被丢弃,并且现在训练的生成器用于产生从训练数据的学习空间中随机采样的新输出。
At first, the generator is terrible. It outputs noise, and the discriminator has no difficulty distinguishing between real and fake. However, the generator improves over time, making the discriminator’s job increasingly harder; this in turn pushes the discriminator to become a better real versus fake detector. When training is declared complete, the discriminator is usually discarded, and the now-trained generator is used to produce new output sampled randomly from the learned space of the training data.
我没有具体说明训练数据是什么,因为我们现在需要知道的是 GAN 是由两个竞争(对抗)网络构建的。对于大多数应用程序来说,归根结底,它就是我们想要的生成器。
I haven’t specified what the training data is, because all we need to know for now is that a GAN is constructed from two competing (adversarial) networks. For most applications, it’s the generator we want when all is said and done.
从结构上来说,我们可以想象一个像图 6-1中的块一样的 GAN 。 (我将及时解释随机向量部分。)从概念上讲,我们看到鉴别器接受两种输入:真实数据和生成器的输出。鉴别器的输出是一个标签:“真实”或“假”。使用反向传播和梯度下降的标准神经网络训练一起训练生成器和鉴别器,但不是同时训练。
Structurally, we can imagine a GAN like the blocks in Figure 6-1. (I’ll explain the random vector part in time.) Conceptually, we see that the discriminator accepts two kinds of inputs: real data and the output of the generator. The discriminator’s output is a label: “Real” or “Fake.” Standard neural network training using backpropagation and gradient descent trains the generator and discriminator together, but not simultaneously.
图 6-1:生成对抗网络架构的概念化
Figure 6-1: Conceptualizing the architecture of a generative adversarial network
例如,使用小批量真实数据(可用真实训练数据的一小部分)进行训练遵循以下步骤:
For example, training with a minibatch of real data—a small subset of the available real training data—follows these steps:
该算法首先使用当前生成器更新鉴别器的权重(步骤 5),然后冻结它们(步骤 6),以便可以在不改变鉴别器的情况下更新生成器的权重。这种方法是必要的,因为我们希望鉴别器的输出(“真实”或“假”标签)来更新生成器部分。请注意,生成器更新将所有假图像标记为真实图像。这样做可以根据假输入对鉴别器的真实程度来对生成器进行评分。
The algorithm first updates the discriminator’s weights using the generator as it currently is (step 5), then freezes them (step 6) so the generator’s weights can be updated without altering the discriminator. This approach is necessary because we want the output of the discriminator—the “Real” or “Fake” labels—to update the generator portion. Notice that the generator update marks all the fake images as real. Doing this scores the generator by how real the fake inputs appear to the discriminator.
让我们检查用作生成器输入的随机向量。 GAN 的重点是学习训练集的表示,我们可以将其视为数据生成器,就像生成真实训练集的数据生成过程一样。然而,在这种情况下,数据生成器可以被视为一个函数,它接受数字的随机集合(随机向量),并将它们转换为可能来自训练集的输出。换句话说,生成器的作用就像一个数据增强设备。生成器的随机输入成为训练集的示例。实际上,生成器是最初创建真实训练集的实际数据生成过程的代理。
Let’s examine the random vector used as input to the generator. The point of a GAN is to learn a representation of the training set that we can think of as a data generator, like the data-generating process that produced the real training set. However, in this case, the data generator can be viewed as a function that takes a random collection of numbers, the random vector, and transforms them into an output that might plausibly have come from the training set. In other words, the generator acts like a data augmentation device. The random input to the generator becomes an example of the training set. In effect, the generator is a proxy for the actual data-generating process that created the real training set in the first place.
数字的随机向量是从概率分布中得出的。从概率分布中采样类似于掷两个骰子并询问它们的总和为 7 与 2 的可能性有多大。总和更有可能是 7,因为有更多方法可以将这两个数字相加并得到 7。只有一种方法可以得到两个:蛇眼。从正态分布中采样是类似的。最常见的返回样本是分布的平均值。平均值两侧的值离平均值越远,出现的可能性就越小,但仍有可能。
The random vector of numbers is drawn from a probability distribution. Sampling from a probability distribution is akin to rolling two dice and asking how likely it is that their sum is a seven versus a two. It’s more likely that the sum is a seven because there are more ways to add the two numbers and get seven. There’s only one way to get two: snake eyes. Sampling from a normal distribution is similar. The most common sample returned is the average value of the distribution. Values on either side of the average are less likely the further away from the average they are, though still possible.
例如,图 6-2显示了人体身高分布的条形图(以英寸为单位)。原始数据集包含 25,000 人的身高,然后将其放入该图的 30 个容器中。条形越高,落入该垃圾箱的人就越多。
For example, Figure 6-2 shows a bar plot of the distribution of human heights in inches. The original dataset contained the heights of 25,000 people, which were then fit into the 30 bins of the figure. The higher the bar, the more people fell into that bin.
图6-2:人体身高分布
Figure 6-2: The distribution of human height
请注意直方图的形状,它看起来像一个钟形,因此它有一个有点老式的名称:钟形曲线。它的现代名称“正态分布”是因为它在自然界中出现得如此频繁,以至于它是通常遇到的分布,尤其是对于物理过程生成的数据。从分布中,我们看到随机选择的人的身高通常在 68 英寸左右:超过 10% 的抽样人群属于该范围。
Note the shape of the histogram, which looks like a bell—hence its somewhat old-fashioned name, the bell curve. Its modern name, the normal distribution, is due to it showing up so often in nature that it’s the distribution normally encountered, especially for data generated by a physical process. From the distribution, we see that the height of a randomly selected person will most often be around 68 inches: more than 10 percent of the sampled population fell into that bin.
GAN 使用的随机向量(也称为噪声向量)的工作原理相同。在这种情况下,平均值为零,大多数样本在 –3 到 3 的范围内。此外,向量中的每个n元素都遵循此范围,这意味着向量本身是来自n维空间的样本,而不是图6-2的一维空间。
The random vector used by a GAN, also known as the noise vector, works the same way. The average, in this case, is zero, with most samples in the range –3 to 3. Also, each of the n elements in the vector follows this range, meaning the vector itself is a sample from an n-dimensional space, not the one-dimensional space of Figure 6-2.
对标记数据集的需求是机器学习的祸根。 GAN 没有这样的限制。我们不关心训练样本的类别是什么,只关心它是真实数据的实例,而不管类别标签是什么。当然,我们仍然要求训练集反映我们想要生成的数据类型,但训练集不需要被标记。
The need for labeled datasets is a bane of machine learning. GANs have no such restriction. We don’t care what a training sample’s class is, only that it’s an instance of real data, regardless of the class label. Of course, we still require that the training set reflect the kind of data we want to generate, but the training set need not be labeled.
****
****
让我们使用我们的老朋友 MNIST 数字数据集构建一个生成对抗网络。生成器将学习将一组随机的 10 个数字(即n为 10)转换为数字图像。经过训练后,我们可以为生成器提供 0 附近 10 个值的任意集合,生成器将生成一个新的数字图像作为输出,从而模仿创建 MNIST 数据集的过程:人们在纸上手写数字。经过训练的 GAN 生成器会产生无限量的目标输出。
Let’s build a generative adversarial network using our old friend, the MNIST digits dataset. The generator will learn to transform a random set of 10 numbers (meaning n is 10) into a digit image. Once trained, we can give the generator any collection of 10 values around zero, and the generator will produce a new digit image as output, thereby mimicking the process that created the MNIST dataset: people writing digits on paper by hand. A trained GAN generator produces an infinite supply of the target output.
我们将使用基于传统神经网络的简单 GAN 来创建一个生成器,用于无限供应 MNIST 风格的数字图像。首先,我们将解开现有的 MNIST 训练集,使每个样本都是一个 784 维向量,就像我们在第 5 章中所做的那样。这给了我们真实的数据。制造假货数据,我们需要 10 元素随机向量,我们将通过从平均值为零的正态分布中抽取 10 个样本来构建这些向量。
We’ll use a simple GAN based on traditional neural networks to create a generator for an infinite supply of MNIST-style digit images. First, we’ll unravel the existing MNIST training set so each sample is a 784-dimensional vector, just as we did in Chapter 5. This gives us the real data. To create fake data, we need 10-element random vectors that we’ll build by drawing 10 samples from a normal distribution with an average value of zero.
该模型的生成器部分接受 10 元素噪声向量作为输入,并生成代表合成数字图像的 784 元素输出向量。回想一下,784 个数字可以重新排列成 28×28 像素的图像。生成器模型具有三个隐藏层,分别具有 256、512 和 1,024 个节点,以及一个具有 784 个节点的输出层来生成图像。隐藏层节点使用修正线性单元的修改版本,称为leaky ReLU。如果输入为正,Leaky ReLU 激活会输出输入,但如果输入为负,则输出是一个小的正值乘以负输入。换句话说,它们有点泄漏。输出层使用双曲正切激活函数,这意味着 784 个输出元素中的每一个都将在 –1 到 +1 的范围内。这是可以接受的。将图像写入磁盘时,我们可以将值调整为 0 到 255。
The generator portion of the model accepts a 10-element noise vector as input and produces a 784-element output vector representing the synthesized digit image. Recall that the 784 numbers can be rearranged into a 28×28-pixel image. The generator model has three hidden layers, with 256, 512, and 1,024 nodes, and an output layer of 784 nodes to produce the image. The hidden layer nodes use a modified version of the rectified linear unit called a leaky ReLU. Leaky ReLU activations output the input if the input is positive, but if the input is negative, the output is a small positive value multiplied by the negative input. In other words, they leak a bit. The output layer uses a hyperbolic tangent activation function, meaning every one of the 784 output elements will be in the range –1 to +1. That’s acceptable. We can scale the values to 0 to 255 when writing an image to disk.
生成器必须在随机噪声矢量输入和输出图像之间进行映射。鉴别器必须将图像作为输入,这意味着 784 维向量。鉴别器具有三个隐藏层,与生成器类似,但相反:1,024 个节点,然后是 512 个节点,最后是 256 个节点。判别器的输出层有一个带有 sigmoid 激活函数的节点。 sigmoid 产生从 0 到 1 的值,我们可以将其解释为判别器相信输入是真实的(输出接近 1)或假的(输出接近 0)。请注意,网络仅使用标准的全连接层。高级 GAN 使用卷积层,但探索这些网络的细节超出了我们的范围。
The generator must map between the random noise vector input and an output image. The discriminator must take an image as input, implying a 784-dimensional vector. The discriminator has three hidden layers, like the generator, but in reverse: 1,024 nodes, then 512 nodes, followed by 256 nodes. The discriminator’s output layer has one node with a sigmoid activation function. The sigmoid produces values from 0 to 1, which we can interpret as the discriminator’s belief that the input is real (output near 1) or fake (output near 0). Notice that the network uses nothing more than standard fully connected layers. Advanced GANs use convolutional layers, but exploring the details of those networks is outside our scope.
图 6-3显示了生成器(顶部)和鉴别器(底部)。两者之间的对称性在隐藏层中的节点数量上很明显,但请注意判别器中的顺序是相反的。
Figure 6-3 shows the generator (top) and discriminator (bottom). The symmetry between the two is evident in the numbers of nodes in the hidden layers, though notice that the order is reversed in the discriminator.
图 6-3:GAN 生成器(顶部)和鉴别器(底部)
Figure 6-3: GAN generator (top) and discriminator (bottom)
生成器接受 10 元素随机向量作为输入,并生成 784 元素假图像输出向量。判别器接受一个真实或虚假的图像向量,并输出一个预测,一个从 0 到 1 的数字。假图像应该产生接近 0 的值,真实图像值接近 1。如果生成器训练有素,判别器将大多数时候都被愚弄了,这意味着对于所有输入,判别器的输出将接近 0.5。
The generator accepts a 10-element random vector as input and produces a 784-element fake image output vector. The discriminator accepts an image vector, real or fake, and outputs a prediction, a number from 0 to 1. Fake images should produce values close to 0 and real images values close to 1. If the generator is well trained, the discriminator will be fooled most of the time, meaning the discriminator’s output will be close to 0.5 for all inputs.
整个网络经过 200 个 epoch(每个 468 个小批量)的训练,总共 93,600 个梯度下降步骤。我们可以在每个时期之后显示生成器的样本,以观察网络的学习情况。图 6-4从左到右显示了 epoch 1、60 和 200 之后的样本。
The entire network is trained for 200 epochs of 468 minibatches each, for a total of 93,600 gradient descent steps. We can display samples from the generator after each epoch to observe the network as it learns. Figure 6-4 shows samples after epochs 1, 60, and 200, from left to right.
图 6-4:第 1、60 和 200 轮之后的生成器输出
Figure 6-4: Generator output after epochs 1, 60, and 200
正如我们所期望的,生成器在一次通过训练数据后表现不佳,但可能没有我们想象的那么差。大多数生成的图像看起来都很像;其他数字形状,如零和二,也存在,尽管有噪音。
As we’d expect, the generator performs poorly after a single pass through the training data, but perhaps not as poorly as we might have thought. Most of the generated images look like ones; other digit shapes, like zeros and twos, are also present, though noisy.
60 个 epoch 后,生成器会生成全范围的数字。有些是正确的,而另一些则仍然困惑或只是部分绘制。经过 200 个纪元后,大多数数字都清晰且定义清晰。生成器经过训练,现在可以根据需要生成数字图像。
After 60 epochs, the generator produces a full range of digits. Some are spot on, while others are still confused or only partially drawn. After 200 epochs, most of the digits are distinct and sharply defined. The generator is trained and now available to produce digit images on demand.
****
****
我们的数字生成器很乐意为我们创建 10,000 个新的数字图像,但如果我们希望所有这些数字都是四呢?随机输入向量会产生一个随机数字,但我们无法选择哪个数字。如果我们随机选择输入向量,我们就可以合理地相信输出数字的混合也将是类似随机的。我通过使用经过训练的生成器创建 1,000 个数字图像来测试该假设。然后,我将这些数字图像传递给在 MNIST 数据集上训练的卷积网络。假设输入是数字图像,卷积网络的测试集准确率高于 99%,这让我们对其预测充满信心。 GAN 生成器生成逼真的数字图像,因此我们有坚实的基础。
Our digit generator will happily create 10,000 new digit images for us, but what if we want all those digits to be fours? A random input vector produces a random digit, but we don’t get to choose which one. If we select input vectors randomly, we can be excused for believing that the mix of output digits will be similarly random. I tested that assumption by using the trained generator to create 1,000 digit images. I then passed those digit images to a convolutional network trained on the MNIST dataset. The convolutional network has a test set accuracy above 99 percent, giving us confidence in its predictions, assuming the input is a digit image. The GAN generator produces realistic digit images, so we’re on solid ground.
假设生成器按照我们的预期运行,那么天真地,每个数字的百分比应该是相同的。有 10 个可能的数字,因此我们预计每个数字出现的概率约为 10%。事实并非如此。表6-1显示了每个数字出现的实际分布情况。
Assuming the generator is acting as we expect, the percentage of each digit should, naively, be the same. There are 10 possible digits, so we expect each to appear about 10 percent of the time. That’s not what happened. Table 6-1 shows the actual distribution of occurrences of each digit.
表 6-1:实际数字分布
Table 6-1: The Actual Digit Distribution
数字 Digit |
百分比 Percentage |
|---|---|
0 0 |
10.3 10.3 |
1 1 |
21.4 21.4 |
2 2 |
4.4 4.4 |
3 3 |
7.6 7.6 |
4 4 |
9.5 9.5 |
5 5 |
6.0 6.0 |
6 6 |
9.1 9.1 |
7 7 |
14.4 14.4 |
8 8 |
4.4 4.4 |
9 9 |
12.9 12.9 |
生成器优先选择 1,然后是 7、9 和 0;八和二是最不可能的输出。因此,GAN 不仅不允许我们选择所需的数字类型,而且它还有明确的偏好。查看图 6-4中最左边的图像,显示了 epoch 1 样本。这些数字大部分都是 1,因此 GAN 对 1 的偏爱从训练一开始就很明显。 GAN 学习了,但优势是有时困扰 GAN 训练的一个问题的症状:即模式崩溃,生成器很早就学习如何创建一个特别好的示例或一组示例来愚弄判别器并陷入困境仅产生该输出,而不产生所需的图像多样性。
The generator favors ones, followed by sevens, nines, and zeros; eights and twos are the least likely outputs. So, not only does the GAN not allow us to select the desired digit type, it has definite favorites. Review the leftmost image in Figure 6-4, showing the epoch 1 samples. Most of those digits are ones, so the GAN’s predilection for ones was evident from the beginning of training. The GAN learned, but the preponderance of ones is a symptom of a problem that sometimes plagues GAN training: namely mode collapse, where the generator learns early on how to create a particularly good example or set of examples that fool the discriminator and gets trapped into producing only that output and not the desired diversity of images.
我们不需要让自己受到挑剔、无法控制的 GAN 的摆布。相反,我们可以在训练期间通过传入我们希望生成器创建的数字类型的指示来调节网络。采用这种方法的 GAN 称为条件 GAN。与无条件 GAN 不同,它们需要带有标签的训练集。
We need not throw ourselves on the mercy of a finicky, uncontrollable GAN. Instead, we can condition the network during training by passing in an indication of the type of digit we want the generator to create. GANs that take this approach are known as conditional GANs. Unlike unconditional GANs, they require training sets with labels.
在条件 GAN 中,生成器的输入仍然是随机噪声向量,但附加到它的是另一个指定所需输出类的向量。例如,MNIST 数据集有 10 个类别,数字为 0 到 9,因此条件向量有 10 个元素。如果所需的类别是数字 3,则条件向量除元素 3 之外全部为零,元素 3 设置为 1。这种表示类信息的方法称为one-hot 编码,因为除了与所需类标签对应的元素为 1 之外,向量的所有元素均为零。
In a conditional GAN, the input to the generator is still a random noise vector, but attached to it is another vector specifying the desired output class. For example, the MNIST dataset has 10 classes, the digits 0 through 9, so the conditional vector has 10 elements. If the desired class is the digit 3, the conditional vector is all zeros except for element 3, which is set to one. This method of representing class information is known as one-hot encoding because all the elements of the vector are zero except for the element corresponding to the desired class label, which is one.
鉴别器还需要类标签。如果鉴别器的输入是图像,我们如何包含类标签?一种方法是将 one-hot 编码的概念扩展到图像。我们知道,一幅彩色图像由三个图像矩阵表示,一个用于红色通道,一个用于绿色通道,一个用于蓝色通道。灰度图像只有一个通道。我们可以将类标签包含为一组附加输入通道,其中除了与类标签对应的通道为 1 之外,所有通道均为零。
The discriminator also needs the class label. If the input to the discriminator is an image, how do we include the class label? One way is to expand the concept of one-hot encoding to images. We know that a color image is represented by three image matrices, one for the red channel, one for the green channel, and one for the blue channel. Grayscale images have only one channel. We can include the class label as a set of additional input channels where all the channels are zero except for the channel corresponding to the class label, which is one.
在生成和区分真假输入时包含类标签会迫使整个网络的每个部分学习如何生成和解释特定于类的输出和输入。如果类标签是 4 并且生成器生成的数字看起来更像零,则鉴别器将知道存在类不匹配,因为它知道来自标记训练集的真实零。
Including the class label when generating and discriminating between real and fake inputs forces each part of the entire network to learn how to produce and interpret class-specific output and input. If the class label is 4 and the digit produced by the generator looks more like a zero, the discriminator will know there’s a class mismatch because it knows about true zeros from the labeled training set.
当使用经过训练的生成器时,条件 GAN 的优势就显现出来了。用户提供所需的类别作为 one-hot 向量,以及无条件 GAN 使用的随机噪声向量。然后,生成器根据噪声向量输出样本,但以所需的类标签为条件。我们可以将条件 GAN 视为一组无条件 GAN,每个 GAN 都针对单一类别的图像进行训练。
The benefit of a conditional GAN comes when using the trained generator. The user supplies the desired class as a one-hot vector, along with the random noise vector used by an unconditional GAN. The generator then outputs a sample based on the noise vector, but conditioned on the desired class label. We can think of a conditional GAN as a set of unconditional GANs, each trained on a single class of images.
我在 MNIST 数据集上训练了一个条件 GAN。在这个例子中,GAN 使用了卷积层,而不是本章前面使用的全连接层。然后,我要求经过充分训练的生成器生成每个数字 10 个样本,如图6-5所示。
I trained a conditional GAN on the MNIST dataset. For this example, the GAN used convolutional layers instead of the fully connected layers used earlier in the chapter. I then asked the fully trained generator to produce 10 samples of each digit, as shown in Figure 6-5.
图 6-5:条件 GAN 输出显示每个数字的样本
Figure 6-5: The conditional GAN output showing samples for each digit
条件 GAN 让我们选择所需的输出类别,这是无条件 GAN 无法做到的,但如果我们想调整输出图像的特定特征怎么办?为此,我们需要一个可控的 GAN。
Conditional GANs let us select the desired output class, which unconditional GANs cannot do, but what if we want to adjust specific features of the output image? For that, we need a controllable GAN.
****
****
无法控制的 GAN 会随意生成图像,而不考虑类标签。条件 GAN 引入了特定于类别的图像生成,如果我们想要使用 GAN 生成合成图像来训练其他模型,这会很有帮助,也许是为了解释我们的示例相对较少的类别。另一方面,可控 GAN允许我们控制生成图像中特定特征的外观。当生成器网络学习时,它会学习一个可以映射到输出图像的抽象空间。随机噪声向量是该空间中的一个点,其中维数是噪声向量中的元素数。每个点都成为一个图像。将相同的点、相同的噪声向量放入生成器中,将输出相同的图像。
Uncontrollable GANs generate images willy-nilly without regard for the class label. Conditional GANs introduce class-specific image generation, which is helpful if we want to use a GAN to generate synthetic imagery for training other models, perhaps to account for a class for which we have relatively few examples. Controllable GANs, on the other hand, allow us to control the appearance of specific features in the generated images. When the generator network learns, it learns an abstract space that can be mapped to the output images. The random noise vector is a point in this space where the number of dimensions is the number of elements in the noise vector. Each point becomes an image. Put the same point, the same noise vector, into the generator, and the same image will be output.
穿过由噪声向量表示的抽象空间会产生一个又一个的输出图像。抽象噪声空间中是否存在对输出图像中的特征有意义的方向?在这里,特征意味着图像中的某些东西。例如,如果生成器生成人脸图像,则特征可能是该人脸是否戴眼镜、留胡子或红头发。
Moving through the abstract space represented by the noise vector produces output image after output image. Might there be directions in the abstract noise space that have meaning for the features in the output image? Here, feature means something in the image. For example, if the generator produces images of human faces, a feature might be whether the face is wearing glasses, has a beard, or has red hair.
可控 GAN 揭示了噪声空间中有意义的方向。沿着这些方向之一移动会改变与该方向相关的特征。当然,现实更加复杂,因为单个方向可能会影响多个特征,具体取决于噪声空间的维数和生成器学习的数据。一般来说,较小的噪声向量更有可能被纠缠,这意味着单个噪声向量维度会影响多个输出特征,从而难以辨别有趣的方向。一些训练技术和更大的噪声向量(可能有 100 个元素而不是我们之前使用的 10 个元素)提高了模型将有趣的特征调整分配到单个方向的机会。理想情况下,会对单个噪声向量元素进行有意义的特征调整。
Controllable GANs uncover meaningful directions in the noise space. Moving along one of those directions alters the feature related to the direction. Of course, the reality is more complex because a single direction might affect multiple features, depending on the dimensionality of the noise space and the data learned by the generator. In general, smaller noise vectors are more likely to be entangled, meaning single noise vector dimensions affect multiple output features, making it difficult to discern interesting directions. Some training techniques and larger noise vectors, perhaps with 100 elements instead of the 10 we used earlier, improve the model’s chance of assigning interesting feature adjustments to a single direction. Ideally, there would be a meaningful feature adjustment for a single noise vector element.
让我们通过一个二维示例来深入了解这个想法。使用二维噪声向量学习生成器可能很困难,但该概念适用于所有维度,并且很容易在二维中进行说明。图6-6就是我们所需要的。
Let’s walk through a two-dimensional example to drive the idea home. Learning a generator using a two-dimensional noise vector might be difficult, but the concept applies to all dimensionalities and is straightforward to illustrate in two dimensions. Figure 6-6 has what we need.
图 6-6:在二维噪声空间和插值 MNIST 数字中移动
Figure 6-6: Moving through a two-dimensional noise space and interpolated MNIST digits
该图的顶部显示了具有两个输入(x坐标和y坐标)的生成器的二维噪声空间。因此,图中的每个点都代表了GAN生成的图像。第一张图像是从 (2, 5) 处的点(圆圈)生成的。第二张图像来自 (6, 1) 处的点(正方形)。箭头显示了穿过噪声空间的方向,我们以某种方式了解到控制输出图像中的特征。如果 GAN 生成人脸,则箭头指向的方向可能会影响人的头发颜色。从 (2, 5) 处的点移动到 (6, 1) 处的点保留了大部分输出图像,但将头发颜色从 (2, 5) 处的黑色更改为 (6, 1) 处的红色。沿着箭头的点代表介于黑色和红色之间的头发颜色。
The top part of the figure shows a two-dimensional noise space for a generator with two inputs, the x-coordinate and the y-coordinate. Therefore, each point in the figure represents an image generated by the GAN. The first image is produced from the point at (2, 5) (the circle). A second image comes from the point at (6, 1) (the square). The arrow shows a direction through the noise space that we somehow learned controls a feature in the output image. If the GAN generates faces, it might be that the arrow points in a direction that affects the person’s hair color. Moving from the point at (2, 5) to the point at (6, 1) maintains most of the output image but changes the hair color from, say, black at (2, 5) to red at (6, 1). Points along the arrow represent hair colors intermediate between black and red.
图 6-6的底部显示了我们训练用来生成数字图像的 GAN 的第三维插值。从左到右,3 会短暂地变成 9,然后变成 4,因为 10 元素噪声向量的第三个元素发生变化,同时保持所有其他元素固定在其初始随机值。噪声向量的维数相对较低,这意味着任何一个维度都不可能只与单个数字特征相关联,这就是整个图像从最初的 3 到 9 到 4 变化的原因。
The bottom of Figure 6-6 shows interpolation along the third dimension of the GAN we trained to generate digit images. From left to right, a three morphs briefly into a nine before becoming a four, as the third element of the 10-element noise vector is varied while keeping all the others fixed at their initial random values. The noise vector is of relatively low dimensionality, implying that it’s unlikely any one dimension is associated with only a single digit trait, which is why the whole image changes from an initial three through a nine to a four.
复杂的 GAN 可以生成逼真但虚假的人脸图像。可控版本学习与特定面部特征相关的方向。例如,请考虑图 6-7,其中左侧显示两个生成的假脸,右侧显示调整后的脸部(来自 Yujun Shen 等人,“Interpreting the Latent Space of GANs for Semantic Face Editing”,2019)。这些调整对应于从原始图像位置沿着代表年龄、眼镜、性别和姿势的学习方向在噪声空间中的移动。
Sophisticated GANs can produce realistic yet fake images of human faces. Controllable versions learn directions linked to specific facial features. For example, consider Figure 6-7, which shows two generated fake faces on the left and adjusted faces on the right (from Yujun Shen et al., “Interpreting the Latent Space of GANs for Semantic Face Editing,” 2019). The adjustments correspond to movement through the noise space from the original image position along learned directions representing age, glasses, gender, and pose.
图 6-7:控制面属性
Figure 6-7: Controlling face attributes
可控 GAN 的威力确实非常强大,生成器通过噪声空间学习有意义的方向也令人印象深刻。然而,GAN 并不是创建真实且可控图像的唯一方法。扩散模型同样会生成逼真的图像;此外,图像由用户定义的文本提示调节。
The power of controllable GANs is genuinely remarkable, and that the generator learns meaningful directions through the noise space is impressive. However, GANs are not the only way to create realistic and controllable images. Diffusion models likewise generate realistic imagery; moreover, imagery conditioned by user-defined text prompts.
****
****
生成对抗网络依赖于生成器和鉴别器之间的竞争来学习创建类似于训练数据的虚假输出。扩散模型代表了达到同一目的的无竞争方法。
Generative adversarial networks rely on competition between the generator and the discriminator to learn to create fake outputs similar to the training data. Diffusion models represent a competition-free approach to the same end.
简而言之,训练扩散模型涉及训练它预测添加到训练图像中的噪声。扩散模型中的推理涉及相反的过程,将噪声转化为图像。伟大的!但对于图像来说,什么是“噪点”呢?
In a nutshell, training a diffusion model involves teaching it to predict noise added to a training image. Inference in a diffusion model involves the opposite, turning noise into an image. Great! But what is “noise” when it comes to images?
噪音意味着随机性,一些没有结构的东西。如果您想到的是收音机中的静电或音频信号中的嘶嘶声,那么您就差不多了。对于数字图像,噪声意味着添加到像素的随机值。例如,如果像素值应为 127,则噪声会增加或减少少量,以便该值变为 124 或 129。添加到图像中的随机噪声通常看起来像雪。扩散模型学习如何预测添加到训练图像中的正态分布噪声量。
Noise implies randomness, something without structure. You’re in the ballpark if you’re thinking of static on a radio or hiss in an audio signal. For a digital image, noise means random values added to the pixels. For example, if the pixel value should be 127, noise adds or subtracts a small amount so that the value becomes, say, 124 or 129. Random noise added to an image often looks like snow. Diffusion models learn how to predict the amount of normally distributed noise added to a training image.
在训练网络之前,我们必须做好几件事。首先,我们需要一个训练数据集。与所有神经网络一样,扩散模型从数据中学习。与 GAN 一样,除非我们想要对训练模型将生成的内容有发言权,否则不需要标签。
We must have several things in place before we train the network. First, we need a training dataset. Diffusion models learn from data, like all neural networks. As with GANs, labels are not required until we want some say in what the trained model will generate.
一旦我们有了训练数据,我们就需要一个神经网络架构。扩散模型在这里并不挑剔,但所选的架构必须接受图像作为输入并生成相同大小的图像作为输出。第 5 章中简要提到的 U-Net 架构是一种常见的选择。
Once we have the training data, we need a neural network architecture. Diffusion models are not picky here, but the selected architecture must accept an image as input and produce a same-sized image as output. The U-Net architecture mentioned briefly in Chapter 5 is a frequent choice.
我们有数据和架构;接下来,我们需要某种方法让网络学习。但学什么?事实上,所需要的只是迫使网络学习添加到图像中的噪声。这一认识背后的数学原理并非微不足道。它涉及概率论,但在实践中,它归结为获取训练图像,添加一些已知水平的正态分布噪声,并将已知噪声与模型预测的噪声进行比较。如果模型学会成功预测噪声,我们稍后可以使用该模型将纯噪声转换为与训练数据相似的图像。
We have data and an architecture; next, we need some way to get the network to learn. But learn what? As it happens, forcing the network to learn the noise added to an image is all that is required. The math behind this realization isn’t trivial. It involves probability theory, but in practice, it boils down to taking a training image, adding some known level of normally distributed noise, and comparing that known noise to what the model predicts. If the model learns to predict the noise successfully, we can later use the model to turn pure noise into an image similar to the training data.
上一段的重要部分是短语“已知的正态分布噪声水平”。正态分布的噪声可以用单个参数来表征,即指定噪声级别的数字。训练包括从训练集中随机选择图像和噪声级别,并将它们作为输入传递到网络。网络的输出是模型对噪声量的估计。输出噪声(本身就是图像)和添加的噪声之间的差异越小越好。应用标准反向传播和梯度下降来最小化小批量上的这种差异,直到模型被宣布训练为止。
The important part of the previous paragraph is the phrase “known level of normally distributed noise.” Normally distributed noise can be characterized by a single parameter, a number specifying the level of the noise. Training consists of selecting an image from the training set and a level of noise, both at random, and passing them as inputs to the network. The output from the network is the model’s estimate of the amount of noise. The smaller the difference between the output noise (itself an image) and the added noise, the better. Standard backpropagation and gradient descent are applied to minimize this difference over minibatches until the model is declared trained.
训练图像中添加噪声的方式会影响模型学习的效果和速度。噪音通常遵循固定的时间表。该时间表是这样的:从当前噪声级别(例如噪声级别 3)移动到下一个噪声级别 4,会向图像添加指定量的噪声,其中噪声量取决于函数。如果每个之间添加相同量的噪声步骤,时间表是线性的。然而,如果步骤之间添加的噪声量取决于步骤本身,则它是非线性的并且遵循某些其他函数。
How noise is added to training images affects how well and how quickly models learn. Noise generally follows a fixed schedule. The schedule is such that moving from a current noise level, say noise level 3, to the next, level 4, adds a specified amount of noise to the image, where the amount of noise depends on a function. If the same amount of noise is added between each step, the schedule is linear. However, if the amount of noise added between steps depends on the step itself, it is nonlinear and follows some other function.
考虑图 6-8,它在左侧显示了可能的训练图像。每行显示添加到训练图像的连续噪声级别。最上面一行遵循线性计划,从左到右移动会在每个步骤之间添加相同的噪声级别,直到图像几乎被破坏。底行遵循所谓的余弦时间表,它破坏图像的速度较慢。这有助于扩散模型更好地学习。好奇的话,图片中这位衣冠楚楚的绅士是我的曾祖父 Emil Kneusel,大约 1895 年。
Consider Figure 6-8, which shows a possible training image on the left. Each row shows successive levels of noise added to the training image. The top row follows a linear schedule, where moving left to right adds the same noise level between each step until the image is almost destroyed. The bottom row follows what is known as a cosine schedule, which destroys the image less rapidly. This helps diffusion models learn a bit better. For the curious, the dapper gentleman in the image is my great-grandfather, Emil Kneusel, circa 1895.
图 6-8:将图像变成噪声的两种方法:线性(上)和余弦(下)
Figure 6-8: Two ways to turn an image into noise: linear (top) and cosine (bottom)
图 6-8仅显示了九个步骤。在实践中,扩散模型使用数百个步骤,关键点是原始图像在过程结束时被破坏,只留下噪声。这很重要,因为从扩散模型采样会逆转将随机噪声图像转变为无噪声图像的过程。实际上,扩散模型的采样从右向左移动,使用经过训练的网络来预测噪声,然后减去噪声以生成先前的图像。对计划中的所有步骤重复此过程即可完成噪声到图像的生成过程。
Figure 6-8 presents only nine steps. In practice, diffusion models use hundreds of steps, the critical point being that the original image is destroyed at the end of the process, leaving only noise. This matters because sampling from the diffusion model reverses the process to turn a random noise image into a noise-free image. In effect, sampling from the diffusion model moves from right to left using the trained network to predict noise that is then subtracted to produce the previous image. Repeating this process for all the steps in the schedule completes the noise-to-image generation process.
****
****
上一节的描述可以概括为两种算法。我鼓励您通读它们,但由于它们有点技术性,因此跳到下一部分始终是一种选择。
The description in the previous section can be summarized in two algorithms. I encourage you to read through them, but as they are a bit technical, skipping ahead to the next section is always an option.
正向算法训练扩散模型,反向算法在推理过程中从训练模型中采样以生成输出图像。让我们从前向算法开始。我们重复以下操作,直到声明模型经过训练:
The forward algorithm trains the diffusion model, and the reverse algorithm samples from a trained model during inference to produce output images. Let’s begin with the forward algorithm. We repeat the following until we declare the model trained:
前向算法之所以有效,是因为有一种直接的方法可以从x 0(训练集中的图像)和随机选择的时间步长t获取x t。这里,T是训练图像转变为纯噪声的最大可能时间步长。通常,T是几百步。回想一下,扩散模型试图学习如何预测e中的噪声。反复迫使模型在预测用于破坏训练图像的噪声方面变得越来越好,这才是让反向步骤发挥作用的原因。
The forward algorithm works because there is a straightforward way to get xt from x0, the image in the training set, and a randomly selected time step, t. Here, T is the maximum possible time step, at which the training image has been turned into pure noise. Typically, T is several hundred steps. Recall that the diffusion model is trying to learn how to predict the noise in e. The act of repeatedly forcing the model to get better and better at predicting the noise used to corrupt the training image is what lets the reverse step work.
反向算法从正向算法训练的扩散模型中进行采样,以生成新的输出图像,从x T中的纯噪声图像开始(想想图 6-8中最右边的图像)。扩散模型用于T步骤,通过重复以下步骤将噪声转化为图像:
The reverse algorithm samples from the diffusion model trained by the forward algorithm to generate a novel output image, beginning with a pure noise image in xT (think the rightmost images in Figure 6-8). The diffusion model is used for T steps to turn noise into an image by repeating the following:
如果按照图 6-8进行思考,则反向算法从右向左移动。左边的每个步骤都是通过使用当前图像作为输入减去扩散模型的输出来找到的,从而从时间步t移动到前一个时间步t – 1。标准噪声图像z确保x t − 1是来自从x t提供x t -1 的概率分布的有效样本。如前所述,我们跳过了很多概率论。
The reverse algorithm moves from right to left, if thinking in terms of Figure 6-8. Each step to the left is found by subtracting the output of the diffusion model using the current image as input, thereby moving from time step t to the previous time step, t – 1. The standard noise image, z, ensures that xt−1 is a valid sample from the probability distribution supplying xt−1 from xt. As mentioned, we’re skipping a lot of probability theory.
采样算法之所以有效,是因为扩散模型估计了其输入中的噪声。该估计导致了对图像的估计,该图像似乎从x t– 1创建了x t。迭代所有T步骤最终会得到x 0,即网络的输出。请注意,与我们之前的网络(有输入并产生输出)不同,扩散模型会重复运行,每次都会产生越来越少的噪声图像,直到最终产生与训练数据相似的图像。
The sampling algorithm works because the diffusion model estimates the noise in its input. That estimate leads to an estimate of the image that, plausibly, created xt from xt–1. Iterating for all T steps brings us, ultimately, to x0, the output of the network. Notice that unlike our previous networks, which had an input and produced an output, diffusion models are run repeatedly, each time producing less and less noisy images, until finally they produce an image similar to the training data.
****
****
扩散模型就像标准 GAN:无条件的。生成的图像是不可控的。您可能会怀疑,如果可以通过某种方式调节 GAN 来指导生成过程,那么扩散模型也可能具有类似的可定向性。如果是这样,你是对的。
Diffusion models are like standard GANs: unconditional. The image generated is not controllable. You might suspect that if a GAN can be conditioned in some way to guide the generation process, then a diffusion model might be similarly directable. If so, you’re right.
我们用来生成类似 MNIST 的数字图像的 GAN 是通过使用选择所需类标签的 one-hot 向量将输入扩展到生成器来调节的。调节扩散模型并不是那么简单,但可以在训练期间向网络提供与图像相关的信号。通常,该信号是表示训练图像内容的文本描述的嵌入向量。我们在第 5 章中简要介绍了嵌入,并将在第 7 章讨论大型语言模型时再次遇到这种情况。
The GAN we used to generate MNIST-like digit images was conditioned by extending the input to the generator with a one-hot vector selecting the desired class label. Conditioning a diffusion model isn’t quite that simple, but it is possible to supply the network with a signal related to the image during training. Typically, that signal is an embedding vector representing a text description of the training image’s contents. We briefly encountered embeddings in Chapter 5 and will do so again in Chapter 7 when discussing large language models.
现在我们需要知道的是,文本嵌入采用像“A big red dogs”这样的字符串,并将其转换为一个大向量,我们将其视为一个点在高维空间中:一个捕获了意义和概念的空间。在训练过程中,当网络学习预测图像中的噪声时,这种文本嵌入的关联对网络的影响与单热类向量对 GAN 生成器的影响非常相似。
All we need to know for now is that a text embedding takes a string like “A big red dog” and turns it into a large vector, which we think of as a point in a high-dimensional space: a space that has captured meaning and concepts. The association of such a text embedding during training while the network is learning to predict noise in images conditions the network in much the same way that the one-hot class vector conditions a GAN generator.
训练后,采样时文本嵌入的存在提供了类似的信号来引导输出图像,使其包含与文本相关的元素。在采样时,文本变成提示,描述我们希望扩散过程生成的图像。
After training, the presence of a text embedding when sampling provides a similar signal to guide the output image so that it contains elements related to the text. At sampling time, the text becomes a prompt, describing the image we want the diffusion process to generate.
扩散模型通常从随机噪声图像开始。他们不需要。如果我们希望输出与现有图像相似,我们可以使用该图像作为初始图像,并添加一定程度的噪声。根据添加噪声的程度,该图像中的样本或多或少与它相似。现在,让我们了解一下条件扩散模型。
Diffusion models typically begin with a random noise image. They need not. If we want the output to be similar to an existing image, we can use that image as the initial image, with some level of noise added. Samples from that image will be, depending on the degree of added noise, more or less similar to it. Now, let’s take a tour of conditional diffusion models.
****
****
商业扩散模型,例如 OpenAI 的 DALL-E 2 或 Stability AI 的 Stable Diffusion,使用用户提供的文本或图像来引导扩散过程,以获得满足提示要求的输出图像。本节中显示的示例是使用 DreamStudio 在线环境通过 Stable Diffusion 生成的。图 6-9向我们展示了达芬奇的《蒙娜丽莎》(左上)及其五种变体。
Commercial diffusion models, such as DALL-E 2 by OpenAI or Stable Diffusion by Stability AI, use the text or image supplied by the user to guide the diffusion process toward an output image satisfying the prompt’s requirements. The examples shown in this section were generated by Stable Diffusion using the DreamStudio online environment. Figure 6-9 presents to us Leonardo da Vinci’s Mona Lisa (upper left) along with five variations of it.
图 6-9:稳定扩散想象的蒙娜丽莎
Figure 6-9: The Mona Lisa as imagined by Stable Diffusion
这些变化是响应原始图像和文本提示的稳定扩散的产物:
The variations are the products of Stable Diffusion in response to the original image and a text prompt:
一位穿着达芬奇风格棕色连衣裙的女人的肖像,柔和的土色
Portrait of a woman wearing a brown dress in the style of DaVinci, soft, earthen colors
DreamStudio 界面允许用户提供初始图像,使用滑块设置要添加的噪声量,从纯噪声图像的 0% 到不添加噪声的 100%。 (是的,这对我来说似乎也是倒退的。)图像的噪声版本初始化了扩散过程。百分比越高,添加的噪声越少,初始图像对最终输出的影响越大。对于《蒙娜丽莎》,我使用了 33%。该噪音水平以及提示和用户可选择的样式产生了图 6-9中的五种变化。变体之间的唯一区别在于所选择的风格(顶行:动漫和奇幻艺术;底行:等距、线条艺术和摄影)。
The DreamStudio interface lets the user supply an initial image, using a slider to set the amount of noise to add, from 0 percent for a pure noise image to 100 percent for no noise added. (Yes, that seems backward to me, too.) The noisy version of the image initializes the diffusion process. The higher the percentage, the less noise is added, and the more the initial image influences the final output. For the Mona Lisa, I used 33 percent. That noise level, along with the prompt and a user-selectable style, produced the five variations in Figure 6-9. The only difference between the variations is the chosen style (top row: anime and fantasy art; bottom row: isometric, line art, and photographic).
结果令人印象深刻。这些图像既不是绘画也不是绘制的,而是从蒙娜丽莎的嘈杂版本和用作指导扩散过程的文本提示中扩散出来的。不难理解,根据提示生成新颖图像的能力将影响商业艺术世界。
The results are impressive. The images were neither painted nor drawn, but diffused from a noisy version of the Mona Lisa and a text prompt used as a guide to direct the diffusion process. It isn’t difficult to appreciate that the ability to generate novel images in response to prompts will impact the commercial art world.
然而,人工智能图像生成并不完美。出现错误,如图6-10所示。我保证我没有要求一只五足的边境牧羊犬、一只多嘴的霸王龙,或者一张像蒙娜丽莎那样双手严重变异的女人的照片。扩散模型似乎在渲染手部方面特别困难,就像人类艺术家一样。
However, AI image generation isn’t perfect. Errors happen, as demonstrated in Figure 6-10. I promise I didn’t ask for a five-legged border collie, a multi-mouthed T. rex, or a picture of a woman like the Mona Lisa with horribly mutated hands. Diffusion models seem to have particular difficulty rendering hands, much like human artists.
图 6-10:扩散模型误差
Figure 6-10: Diffusion model errors
编写有效的提示已成为一种艺术形式,它已经创造了一种新的工作:提示工程师。文本提示的确切形式强烈影响图像生成过程,最初选择的随机噪声图像也是如此。 DreamStudio 界面允许用户修复伪随机数生成器种子,这意味着扩散过程每次都以相同的噪声图像开始。修复种子,同时稍微改变文本提示,让我们可以通过实验来了解扩散过程的敏感程度。
Writing effective prompts has become an art form, one that has already created a new kind of job: prompt engineer. The exact form of the text prompt strongly influences the image generation process, as does the random noise image initially selected. The DreamStudio interface allows users to fix the pseudorandom number generator seed, meaning the diffusion process starts with the same noise image each time. Fixing the seed while slightly altering the text prompt lets us experiment to learn how sensitive the diffusion process can be.
图 6-11中的图像是通过单词ornate、green和vase的排列生成的。 (这些图像在书中以黑白显示,但都是相似的绿色阴影。)每次初始噪声图像都是相同的;只是这三个词的顺序不同。三个花瓶相似,但第四个却截然不同。尽管如此,这四个都是华丽的绿色花瓶的有效典范。
The images in Figure 6-11 were generated by permutations of the words ornate, green, and vase. (These images are shown in black and white in the book, but all are similar shades of green.) The initial noise image was the same each time; only the order of the three words varied. Three of the vases are similar, but the fourth is quite different. Nonetheless, all four are valid exemplars of ornate, green vases.
图 6-11:扩散模型生成的花瓶
Figure 6-11: Vases generated by a diffusion model
提示顺序和措辞很重要,因为即使提示单词或其含义相似,由文本提示形成的嵌入向量也不同。前三个花瓶的提示可能在文本嵌入空间中彼此靠近,这解释了为什么它们看起来非常相似。最后一个提示,无论出于何种原因,落在其他地方,导致生成的图像质量不同。有趣的是,最后一张图片的提示是“华丽,绿色,花瓶”,这种形式遵循语法惯例。
Prompt order and phrasing matter because the embedding vector formed from the text prompt differs, even if the prompt words or their meanings are similar. The prompts for the first three vases likely landed close to each other in the text embedding space, explaining why they look much the same. The last prompt, for whatever reason, landed elsewhere, leading to the different qualities of the generated image. Interestingly, the prompt for the last image was “ornate, green, vase,” the form following grammatical convention.
出于好奇,我改变了提示“华丽,绿色,花瓶”,将“绿色”更改为其他颜色,并使用与以前相同的初始噪声图像。结果如图6-12所示。从左到右,指定的颜色是红色、淡紫色、黄色和蓝色。前三幅图像与图 6-11中的最后一个花瓶类似;只有蓝色花瓶有显着差异。
Curious, I altered the prompt “ornate, green, vase,” changing “green” to other colors and using the same initial noise image as before. The results are in Figure 6-12. From left to right, the colors specified were red, mauve, yellow, and blue. The first three images are similar to the last vase in Figure 6-11; only the blue vase differs significantly.
图6-12:生成的多种颜色的花瓶
Figure 6-12: Generated vases of many colors
我在实验中注意到扩散模型的另一个特性,即生成的图像比原始图像的噪声更少。假设输入图像分辨率低且有颗粒感。在这种情况下,扩散模型的输出分辨率更高且清晰,因为输出不是应用于原始图像的操作的结果,而是使用引导提示重新想象图像。如果不严格要求对原始图像的绝对保真度,是否可以使用扩散模型来消除图像伪影?
I noticed another property of diffusion models during my experiments, namely, that the generated images have less noise than the originals. Suppose an input image is low resolution and grainy. In that case, the diffusion model’s output is higher resolution and clear because the output is not the result of an operation applied to the original image but a reimagining of the image using the prompt for guidance. Might it be possible to use diffusion models to remove image artifacts if absolute fidelity to the original image isn’t strictly required?
图6-13试图回答这个问题。左侧是原始 195×256 像素图像放大到 586×768 像素(系数为 3)。使用标准图像处理程序和三次插值对图像进行放大。右侧是扩散模型输出,也是 586×768 像素。扩散模型输出使用195×256像素的原始图像,添加了25%的噪声,摄影风格,提示“详细,原始”。扩散图像效果更好。和原版不太一样,但是一个封闭的副本。我不认为这种方法可以与基于深度学习的超分辨率网络竞争,但无论最终效用如何,它都是扩散模型的一个有趣的应用。
Figure 6-13 tries to answer this question. The original 195×256-pixel image upscaled to 586×768 pixels (a factor of 3) is on the left. The image was upscaled using a standard image processing program and cubic interpolation. The diffusion model output, also 586×768 pixels, is on the right. The diffusion model output used the 195×256-pixel original image with 25 percent added noise, a photographic style, and the prompt “detailed, original.” The diffusion image is better. It’s not identical to the original, but a close copy. I don’t believe this approach competes with deep learning–based super-resolution networks, but regardless of ultimate utility, it was an interesting application of diffusion models.
图 6-13:扩散模型图像增强
Figure 6-13: Diffusion model image enhancement
再举一个例子,请考虑图 6-14,其中显示了一张西部草地鹨在约 100 米处通过科罗拉多州恶劣、烟雾弥漫的空气拍摄的图像(左)。中心图像代表使用标准图像处理程序 (Gimp) 改进图像的最大努力。右侧的版本是在给定中心图像添加少量噪声(约 12%)和以下文本提示时稳定扩散的输出:
As another example, consider Figure 6-14, which shows an image of a Western Meadowlark taken at a distance of about 100 meters through poor, smoky Colorado air (left). The center image represents a best effort at improving the image using a standard image manipulation program (Gimp). The version on the right is the output of Stable Diffusion when given the center image with a small amount of noise added (about 12 percent) and the following text prompt:
西部草地鹨,高度详细,高分辨率,无噪音
western meadowlark, highly detailed, high resolution, noise free
图 6-14:扩散模型图像增强实验,试图改善西部草地鹨的烟雾模糊图像:原始(左),使用标准图像处理程序的最大努力(中),使用稳定扩散增强(右)
Figure 6-14: A diffusion model image enhancement experiment attempting to improve a smoke-obscured image of a Western Meadowlark: original (left), best effort with a standard image manipulation program (center), enhanced with Stable Diffusion (right)
稳定扩散并没有创造奇迹,但输出肯定比原始图像更好。
Stable Diffusion didn’t work a miracle, but the output is definitely better than the original image.
****
****
本章探讨了两种生成网络:生成对抗网络和扩散模型。两者都根据随机输入创建图像。
This chapter explored two kinds of generative networks: generative adversarial networks and diffusion models. Both create images from random inputs.
GAN 联合训练生成器和鉴别器网络,以教导生成器产生欺骗鉴别器的输出。条件 GAN 在训练和生成过程中使用类标签来引导生成器输出属于用户指定类的成员。可控 GAN 通过与生成输出的基本特征相关的噪声向量空间学习方向,以便沿着这些方向的移动可预测地改变输出图像。
GANs jointly train generator and discriminator networks to teach the generator to produce output that fools the discriminator. Conditional GANs use class labels during training and generation to direct the generator toward outputs that are members of a user-specified class. Controllable GANs learn directions through the noise vector space related to essential features of the generated output, such that movement along those directions predictably alters the output image.
扩散模型学习预测图像中的噪声量。训练扩散模型需要向其提供干净的训练图像,这些图像故意带有已知量的噪声。模型的预测和已知的添加噪声用于更新模型的权重。条件扩散模型将嵌入(通常来自训练图像内容的文本描述)与噪声相关联,以便在生成时,模型定向到包含与用户文本提示相关联的元素的图像。如果使用添加了一定程度噪声的现有图像来代替纯随机初始图像,则会产生变化。
Diffusion models learn to predict the amount of noise in an image. Training a diffusion model involves feeding it clean training images that are intentionally made noisy by a known amount. The model’s prediction and the known added noise are used to update the model’s weights. Conditional diffusion models associate an embedding, usually from a text description of the training image content, with the noise so that at generation time, the model is directed to images containing elements associated with the user’s text prompt. Variations are generated if an existing image, with some level of noise added, is used in place of the pure random initial image.
介绍中提到了三种生成式人工智能模型。最后一种,即大型语言模型,目前正在威胁着以与工业革命相当的水平深刻地改变世界,即使不是像一些人工智能从业者声称的那样是轮子和火。此类重大主张需要我们予以关注。因此,让我们最终讨论一下很可能是真正的人工智能。
The introduction mentioned three kinds of generative AI models. The last one, large language models, is presently threatening to profoundly alter the world at a level equal to the industrial revolution, if not the wheel and fire, as some AI practitioners claim. Such consequential claims require us to pay attention. Therefore, let’s move on to what might very well be true AI at last.
关键术语
KEY TERMS
条件 GAN、可控 GAN、扩散模型、判别器、纠缠、生成对抗网络 (GAN)、生成 AI、生成器、leaky ReLU、模式崩溃、噪声向量、one-hot 编码、调度
conditional GAN, controllable GAN, diffusion model, discriminator, entangled, generative adversarial network (GAN), generative AI, generator, leaky ReLU, mode collapse, noise vector, one-hot encoding, schedule
未来的历史学家可能会将 2022 年秋季发布的 OpenAI ChatGPT 大语言模型视为真正人工智能的黎明。鉴于我在 2023 年 3 月下旬撰写本文时所看到的情况,我同意这样的评估。
Future historians might point to the fall 2022 release of OpenAI’s ChatGPT large language model as the dawn of true AI. Given what I’ve already seen as I write this in late March 2023, I would agree with such an assessment.
在本章中,我们将首先探讨现有的大型语言模型可以做什么,然后描述它们是什么以及它们如何工作。尽管它们具有令人印象深刻的能力,但最终这些模型都是像之前的所有神经网络一样构建和训练的神经网络。仅这一事实就意味着联结主义者从一开始就是正确的。弗兰克·罗森布拉特可能在他的坟墓里微笑吗?
In this chapter, we’ll first explore what existing large language models can do, then follow that up with a description of what they are and how they work. For all their impressive abilities, ultimately these models are neural networks built and trained like all the neural networks that came before. That fact alone means the connectionists were right from the beginning. Might Frank Rosenblatt be smiling in his grave?
我已经表明了我的信念:ChatGPT 和类似的模型代表了一些值得被称为真正人工智能的新事物。我希望,在本章结束时,您会同意。
I’ve already tipped my hand regarding my belief that ChatGPT and models like it represent something new that’s worthy of being called true AI. My hope is that, by the end of the chapter, you’ll agree.
****
****
人工智能这个词有些含糊,在我们继续之前必须提供更细致的定义。从业者通常将人工智能分为两类:狭义人工智能(ANI)和通用人工智能(AGI)。前者概括了我们迄今为止讨论的所有内容。后者指的是真正有感知力的智能机器——科幻小说中的东西。
The phrase artificial intelligence is somewhat ambiguous and must be provided with a more nuanced definition before we proceed. Practitioners typically divide AI into two kinds: artificial narrow intelligence (ANI) and artificial general intelligence (AGI). The former encapsulates everything we’ve discussed so far. The latter refers to truly sentient and intelligent machines—the stuff of science fiction.
截至本书撰写之时,现有的模型绝对不是 AGI。然而,它们不仅仅是 ANI;它们还包括 ANI。它们似乎是一种全新的东西,介于两者之间。微软研究人员 Sébastien Bubeck 等人最近发表的一篇论文的标题“通用人工智能的火花”让我觉得很恰当。
The models existing as of the time of writing of this book are definitely not AGI. However, they are not merely ANI; they appear to be something entirely new, something in between. The title of a recent paper by Microsoft researchers Sébastien Bubeck et al., “Sparks of Artificial General Intelligence,” strikes me as appropriate.
大型语言模型 (LLM)接受用户提供的文本提示作为输入。然后,他们使用提示和所有先前生成的单词作为指导,逐字生成输出文本(实际上是逐个标记)。实际上,法学硕士的唯一设计目标是非常擅长预测输入提示启动的单词序列中的下一个单词。这就是他们受过训练要做的全部事情。然而,这并不是他们学会做的全部。人工智能研究人员对法学硕士如此兴奋的原因是,在学习成为专家文本生成器的过程中,法学硕士还学习了许多新兴能力,包括问题回答、数学推理、高质量计算机编程和逻辑推理。
Large language models (LLMs) accept as input a text prompt supplied by a user. They then generate output text, word by word (really, token by token), using the prompt and all previously generated words as a guide. In effect, LLMs’ only design goal is to be very good at predicting the next word in a sequence of words initiated by the input prompt. That’s all they’re trained to do. However, that is not all that they learn to do. The reason why AI researchers are so excited by LLMs is that somewhere along the way, while learning to be expert text generators, LLMs also learn a host of emergent abilities, including question answering, mathematical reasoning, high-quality computer programming, and logical reasoning.
这些意想不到的突发能力的哲学含义是深远的。法学硕士的能力提出了关于思想的本质、意识的意义以及人类思维的(假设的)独特性的问题。我们无法深入回答这些问题,但我将在第 8 章中回顾其中的一些问题。
The philosophical implications of these unexpected, emergent abilities are profound. The abilities of LLMs raise questions about the nature of thought, the meaning of consciousness, and the (assumed) uniqueness of the human mind. We’re not in a position to answer these questions in depth, but I will return to some of them in Chapter 8.
现在,让我们开始探索法学硕士的能力。我们在这里只能触及一些新兴能力,但互联网上已经充满了法学硕士在行动中令人印象深刻、有趣甚至可怕的例子。我建议回顾一下这些例子,但与往常一样,必须谨慎判断信息源的质量。请参阅本书末尾的资源,了解我建议查看的一些在线(和印刷)资源。
Now, let’s get into things by exploring the capabilities of LLMs. We can touch on only a few emergent abilities here, but the internet is already full of impressive, funny, and even frightening examples of LLMs in action. I recommend a review of such examples, though as always, care must be taken to judge the quality of the information source. See the resources at the end of this book for some of the online (and print) resources I suggest taking a look at.
人工智能社区正在这一领域以尽可能快的速度前进。当您阅读这些文字时,法学硕士无疑会比我撰写本文时更加令人印象深刻。
The AI community is racing ahead as fast as possible in this area. By the time you read these words, LLMs will undoubtedly be even more impressive than they already are at the time I’m writing them.
****
****
以下是法学硕士的实践示例。对于大多数情况,我将显示三个模型的输出:Bard、GPT-3.5 和 GPT-4。所有输出均代表截至 2023 年 3 月下旬的模型性能。在每种情况下,我都通过网络浏览器使用公共界面。巴德是谷歌的模特。 GPT-3.5和GPT-4来自OpenAI。
What follows are examples of LLMs in action. For most situations, I’ll show the output of three models: Bard, GPT-3.5, and GPT-4. All output represents model performance as of late March 2023. In each case, I used the public interface via a web browser. Bard is a Google model. GPT-3.5 and GPT-4 are from OpenAI.
法学硕士是语言的专家。这些模型以对话框格式使用,以用户的提示或问题开始。例如,我给这三个模型中的每一个都给出了这样的提示:
LLMs are expert users of language. The models are used in dialog format, beginning with a prompt or question from the user. For example, I gave each of the three models this prompt:
写一篇适合六岁以下儿童的关于泰迪熊、魔毯和月亮的单段睡前故事。
Write a one-paragraph bedtime story about a teddy bear, a magic carpet, and the moon that is suitable for children under the age of six.
以下是巴德的回复:
Here is Bard’s reply:
从前,有一只小泰迪熊,住在玩具箱里。他喜欢和朋友们一起玩,但他一直梦想着飞向月球。有一天,他在玩具箱里发现了一张魔毯。地毯会飞!泰迪熊非常兴奋,他跳上地毯,飞向月球。他在月球探索中度过了一段美好的时光,甚至还遇到了一些友好的外星人。到了回家的时间,他告别了新朋友,乘坐魔毯飞回了地球。他很高兴回到家,迫不及待地想告诉他的朋友们他的冒险经历。
Once upon a time, there was a little teddy bear who lived in a toy box. He loved to play with his friends, but he always dreamed of flying to the moon. One day, he found a magic carpet in his toy box. The carpet could fly! The teddy bear was so excited, he jumped on the carpet and flew to the moon. He had a wonderful time exploring the moon, and he even met some friendly aliens. When it was time to go home, he said goodbye to his new friends and flew back to Earth on his magic carpet. He was so happy to be home, and he couldn’t wait to tell his friends all about his adventure.
故事完整,只有一个段落,并且使用适合儿童的简单句子结构。
The story is complete, in a single paragraph, and uses a simple sentence structure suitable for children.
GPT-3.5是这样回复的:
GPT-3.5 replied like this:
从前,有一只名叫蒂米的泰迪熊,他喜欢冒险。一天晚上,他跳上魔毯,向着月亮飞去。月亮又大又亮,蒂米忍不住去探索它。他在陨石坑周围跳来跳去,遇到了一些友好的外星人,他们向他展示了他们的月球花园。蒂米度过了一段美好的时光,但最终,他知道是时候回家了。他跳回到他的魔毯上,飞回他舒适的床上,拥抱着他最喜欢的毯子,很快就睡着了,梦想着他惊人的月球冒险。
Once upon a time, there was a teddy bear named Timmy who loved to go on adventures. One night, he hopped onto a magic carpet and flew up towards the moon. The moon was so big and bright, and Timmy couldn’t resist exploring it. He hopped and skipped around the craters and met some friendly aliens who showed him their moon garden. Timmy had a great time, but eventually, he knew it was time to go back home. He hopped back onto his magic carpet and flew back to his cozy bed, where he cuddled up with his favorite blanket and fell fast asleep, dreaming of his amazing moon adventure.
GPT-3.5 的故事也很完整,其中还包括友好的外星人。这一次,泰迪熊有了一个名字:蒂米。该语言仍然适合幼儿,但 GPT-3.5 更具描述性。巴德告诉我们泰迪熊探索月球并遇到了一些友好的外星人,但 GPT-3.5 告诉我们蒂米在陨石坑周围跳来跳去,友好的外星人向他展示了他们的月球花园。
GPT-3.5’s story is also complete and includes friendly aliens as well. This time, the teddy bear gets a name, Timmy. The language is still appropriate for young children, but GPT-3.5 was more descriptive. Bard tells us that the teddy bear explores the moon and meets some friendly aliens, but GPT-3.5 tells us that Timmy hopped and skipped around the craters and that the friendly aliens showed him their moon garden.
OpenAI 最先进的模型 GPT-4 产生了以下结果:
GPT-4, OpenAI’s most advanced model, produced this:
从前,在闪烁的夜空下,一只名叫卡德尔的小泰迪熊发现了他舒适房间的角落里藏着一张色彩缤纷的魔毯。好奇又兴奋,卡德尔爬上地毯,一起飞上天空,飞向银色的月亮。当他们穿过柔软、天鹅绒般的云层时,拥抱和魔毯与闪烁的星星共舞,并在凉爽的夜风中低声诉说着秘密。当他们到达月球时,月球以温柔、灿烂的微笑欢迎他们,他们在月球柔软、布满灰尘的陨石坑里玩了一整夜的捉迷藏。当太阳开始升起时,抱抱和魔毯向他们的新月球朋友道别,然后飘回小房间,准备好享受一个充满甜蜜梦境的宁静睡眠。
Once upon a time, under a twinkling night sky, a little teddy bear named Cuddles discovered a colorful magic carpet hidden in the corner of his cozy room. Curious and excited, Cuddles climbed onto the carpet, and together, they soared up to the sky, bound for the silvery moon. As they journeyed through the soft, velvety clouds, Cuddles and the magic carpet danced with twinkling stars and whispered secrets to the cool night breeze. When they reached the moon, it welcomed them with a gentle, glowing smile, and they spent the night playing hide-and-seek in the moon’s soft, dusty craters. As the sun began to rise, Cuddles and the magic carpet said their goodbyes to their new lunar friend and floated back down to the little room, ready for a peaceful night’s sleep filled with the sweetest dreams.
主观上,我觉得 GPT-4 制作了最具描述性的故事,尽管我确实喜欢 GPT-3.5 的月亮花园。巴德的故事很枯燥,但是也许对于年幼的孩子来说更容易接受。然而,它们最引人注目的一点是它们完全可以被书写。不到一年前(请记住,对我来说现在是 2023 年初),谈论人工智能生成的故事只不过是一厢情愿的想法。
Subjectively, I feel that GPT-4 produced the most descriptive story, though I do like GPT-3.5’s moon garden. Bard’s story was pretty dry but perhaps more accessible to very young children. The most remarkable thing about all of them, however, is that they could be written at all. Less than a year ago (remember, it’s early 2023 for me), talking about AI-generated stories was little more than wishful thinking.
我改变了给 GPT-4 的提示,要求提供一个涉及泰迪熊、魔毯和月亮的单段落故事,但这次是成年人感兴趣,而不是儿童。这是它的回复:
I altered the prompt I gave GPT-4 to ask for a one-paragraph story involving a teddy bear, a magic carpet, and the moon, but this time of interest to adults, not children. Here’s its reply:
曾几何时,在暮光的怀抱中,一只孤独的泰迪熊,被时间的反复无常磨损和遗忘,发现了一张古老的魔毯,它的边缘磨损,但脉动着超凡脱俗的能量。在月光的注视下,两人踏上了夜间之旅,在一个变得冷酷而愤世嫉俗的世界的迷宫中穿梭,梦想被抛弃,纯真成为遥远的记忆。当柔和的月光将银色的光芒投射在这对看似不太可能的伙伴身上时,泰迪熊成为了早已埋藏在成人忧虑重压之下的梦想的象征,而魔毯则低声诉说着潜藏着的不羁的自由和冒险。在那些敢于相信的人心中。他们一起翱翔,令人怀念地想起对童年转瞬即逝的魔力的苦乐参半的渴望,留下了一条星尘的轨迹,闪烁着希望的承诺和曾经的时代的忧郁之美。
Once upon a twilight’s embrace, a solitary teddy bear, worn and forgotten by the whims of time, discovered an ancient magic carpet, frayed at the edges yet pulsating with an otherworldly energy. Under the watchful gaze of the moon, the duo embarked on a nocturnal journey, weaving through the labyrinth of a world grown cold and cynical, where dreams were forsaken and innocence a distant memory. As the gentle light of the moon cast its silver glow upon the unlikely companions, the teddy bear served as a symbol of the dreams long since buried beneath the weight of adult concerns, while the magic carpet whispered of the untamed freedom and adventure that lay dormant in the hearts of those who dared to believe. Together, they soared, a wistful reminder of the bittersweet yearning for the fleeting magic of childhood, leaving behind a trail of stardust that shimmered with the promise of hope and the melancholic beauty of a time that once was.
GPT-4 显然对语言和语法有着扎实的理解。有人可能会说它也理解人类的感受和情绪(尽管它不理解,不是吗?)。
GPT-4 clearly possesses a solid understanding of language and grammar. One might argue that it understands human feelings and emotions as well (though it doesn’t, does it?).
在深入探讨这种奇怪的新思维内部的内容之前,让我们先回顾几个例子。
Let’s review a few more examples before diving into what’s inside this strange, new kind of mind.
****
****
GPT-4 可以理解图像以及许多不同的编程和标记语言,包括 LaTeX,一种常用于图书出版(如本书)的标记语言。我让 GPT-4 生成 LaTeX 来绘制一个简单的乡村场景,其中显示一座房子、一个谷仓、一个粮仓和四头牛。图 7-1显示了结果。
GPT-4 understands imagery along with many different programming and markup languages, including LaTeX, a markup language commonly used in book publishing (like this book). I asked GPT-4 to generate LaTeX to draw a simple country scene showing a house, a barn, a grain silo, and four cows. Figure 7-1 shows the result.
图7-1:GPT-4创建的乡村场景
Figure 7-1: A country scene created by GPT-4
你可能会对这种粗糙的描述感到好笑,但请考虑一下创建它的所有过程。这是我的提示:
You might be tempted to chuckle at the crude depiction, but consider all that went into creating it. Here’s my prompt:
使用 TikZ 生成 LaTeX 代码来绘制以下内容:有一座房子、一个谷仓、一个粮仓和四头牛的乡村场景
produce LaTeX code using TikZ to draw the following: a country scene with a house, a barn, a grain silo, and four cows
GPT-4 必须理解说明:画一个有房子、谷仓、粮仓和四头牛的场景。这意味着它必须理解关键词的含义并正确地将它们关联起来,以便“谷物”和“筒仓”结合在一起,同样,“四”和“奶牛”也可以结合在一起。
GPT-4 had to understand the directions: draw a scene with a house, a barn, a grain silo, and four cows. That means it had to understand the meaning of key words and correctly associate them so that “grain” and “silo” went together, and likewise “four” and “cows.”
接下来,它必须“想象”场景的布局以及如何使用 TikZ 包提供的原始形状来表示每个对象。 (TikZ 是一个用于绘制简单图形形状的 LaTeX 软件包。)图中没有显示,但谷仓和房屋都有红色屋顶。巧合?
Next, it had to “imagine” the layout of the scene and how each object could be represented using the crude shapes supplied by the TikZ package. (TikZ is a LaTeX package for drawing simple graphic shapes.) The figure doesn’t show it, but the barn and house both have red roofs. Coincidence?
GPT-4 不是像第 6 章的图像生成器那样的扩散模型。 GPT-4 创建的输出不是图 7-1中的图像,而是以下 LaTeX 代码:
GPT-4 isn’t a diffusion model like the image generators of Chapter 6. The output created by GPT-4 wasn’t the image in Figure 7-1, but the following LaTeX code:
% 谷仓 \draw[fill=brown!50] (5,0) 矩形 (8,3); \draw[fill=red!50] (6.5,3)--(8,3)--(6.5,5)--(5,3)--循环; \draw[fill=white!70] (6,1) 矩形 (7,2);
% Barn \draw[fill=brown!50] (5,0) rectangle (8,3); \draw[fill=red!50] (6.5,3)--(8,3)--(6.5,5)--(5,3)--cycle; \draw[fill=white!70] (6,1) rectangle (7,2);
我使用该代码创建了图 7-1。
I used that code to create Figure 7-1.
如果您想知道 GPT-4 是如何做到这一切的,我和其他许多人也想知道。这些能力是在训练后从模型中显现出来的。他们不是故意的。这就是为什么我相信未来的历史学家会将 2022 年秋季标记为真正人工智能的黎明。抓紧你的帽子;它变得更好了。
If you’re wondering how GPT-4 could do all of this, so am I and many other people. Such abilities emerged from the model when trained; they were not intended. This is why I believe future historians will mark fall 2022 as the dawn of true AI. Hold on to your hats; it gets better.
****
****
心理学家讨论了一个被称为“心理理论”的概念,它指的是推断他人心理状态并将想法或信念归因于他人的能力。只有人类和其他一些灵长类动物拥有心智理论。也就是说,到现在为止。
Psychologists discuss a concept known as the “theory of mind,” which refers to the ability to infer the mental states of others and to ascribe thoughts or beliefs to them. Only human beings and some other primates possess theory of mind. That is, until now.
大型语言模型似乎也拥有与年龄较大的人类儿童相当的心智理论,甚至更高(对于 GPT-4)。这是斯坦福大学的 Michal Kosinski 在他 2023 年的论文《Theory of Mind May Have Spontaneously Emerged in Large Language Models》中得出的结论。让我们通过要求 Bard、GPT-3.5 和 GPT-4 回答一个需要了解另一个人的想法的问题来检验这一说法。这是提示:
Large language models also seem to possess theory of mind at a level matching that of older human children, if not higher (for GPT-4). This is the conclusion reached by Michal Kosinski of Stanford University in his 2023 paper “Theory of Mind May Have Spontaneously Emerged in Large Language Models.” Let’s put this claim to the test by asking Bard, GPT-3.5, and GPT-4 to answer a question that requires knowing what might be in another person’s mind. Here’s the prompt:
玛丽告诉阿曼达她爱上了约翰。阿曼达告诉凯西玛丽所说的话。一周后,玛丽告诉阿曼达,她不再爱约翰,而是爱拉尔夫。之后,阿曼达与凯西谈论了天气。凯西认为谁是玛丽的爱人?
Mary told Amanda that she was in love with John. Amanda told Kathy what Mary had said. A week later, Mary told Amanda that she was no longer in love with John but was now in love with Ralph. After that, Amanda spoke with Kathy about the weather. Who does Kathy think is Mary’s love?
在继续之前,请得出自己的结论:凯西认为玛丽爱着谁?
Before continuing, reach your own conclusion: who does Kathy think Mary is in love with?
每个模型都回答了问题,然后提供了其基本原理。以下是巴德的回复:
Each model answered the question and then offered its rationale. Here is Bard’s reply:
凯西认为玛丽的爱人是拉尔夫。
Kathy thinks Mary’s love is Ralph.
凯西知道玛丽爱上了约翰,但她也知道玛丽不再爱约翰。凯西也知道玛丽现在爱上了拉尔夫。因此,凯西认为玛丽的爱人是拉尔夫。
Kathy knows that Mary was in love with John, but she also knows that Mary is no longer in love with John. Kathy also knows that Mary is now in love with Ralph. Therefore, Kathy thinks Mary’s love is Ralph.
凯西可能弄错了。玛丽也可能不确定她爱谁。然而,根据凯西掌握的信息,凯西似乎认为玛丽的爱人是拉尔夫。
It is possible that Kathy is mistaken. It is also possible that Mary is not sure who she loves. However, based on the information that Kathy has, it seems that Kathy thinks Mary’s love is Ralph.
我发现有趣的是,巴德证明了其答案的合理性,然后进一步评论了玛丽可能的想法。然而,巴德认为阿曼达一定告诉了凯西玛丽的感情变化。
I find it interesting that Bard justified its answer, then further commented on what Mary might be thinking. However, Bard assumed that Amanda must have told Kathy about Mary’s change in affection.
以下是 GPT-3.5 对这种情况的看法:
Here’s what GPT-3.5 thinks of the scenario:
根据所提供的信息,无法确定凯西认为谁是玛丽的爱人。我们掌握的最后一条信息是阿曼达与凯西谈论了天气,没有迹象表明凯西被告知玛丽改变了主意。因此,凯西可能仍然相信玛丽爱着约翰,或者可能对此事没有任何意见。
Based on the information given, it is impossible to determine who Kathy thinks is Mary’s love. The last piece of information we have is that Amanda spoke with Kathy about the weather, and there is no indication that Kathy was told about Mary’s change of heart. Therefore, Kathy may still believe that Mary is in love with John or may not have any opinion on the matter.
GPT-3.5 没有做出与 Bard 相同的假设。据了解,事实仅表明阿曼达和凯西在玛丽改变主意后谈论了天气。令我印象深刻的是,它使用了“最后一条信息”这一短语,这意味着 GPT-3.5 正在以某种方式对提示中的事件进行排序。
GPT-3.5 does not make the same assumption that Bard makes. It’s aware that the facts only indicate that Amanda and Kathy spoke about the weather after Mary’s change of heart. I’m impressed that it used the phrase “the last piece of information,” implying that GPT-3.5 was somehow sequencing the events in the prompt.
最后,GPT-4 有什么要说的?让我们来了解一下:
Finally, what does GPT-4 have to say? Let’s find out:
凯西认为玛丽的爱人是约翰。这是因为阿曼达早些时候告诉凯西玛丽对约翰的感情,但她没有告诉凯西当玛丽爱上拉尔夫时玛丽感情的变化。
Kathy thinks Mary’s love is John. This is because Amanda told Kathy about Mary’s feelings for John earlier, but she did not tell Kathy about the change in Mary’s feelings when Mary fell in love with Ralph.
GPT-4 的答案可能是大多数人会给出的答案。提示仅指出阿曼达和凯西谈论了天气,因此大多数人可能会认为这意味着阿曼达和凯西没有谈论任何其他主题。
GPT-4’s answer is the likely answer most humans would give. The prompt states only that Amanda and Kathy spoke about the weather, so the majority of humans would likely take that to mean Amanda and Kathy didn’t talk about any other subjects.
当前呈现的大型语言模型使用聊天格式,这意味着用户可以输入后续提示,并且模型会将这些提示应用到对话的早期部分。例如,我使用以下提示跟进 GPT-4:
Large language models, as currently presented, use a chat format, meaning the user can enter follow-up prompts, and the model will apply those prompts to the earlier part of the conversation. For example, I followed up with GPT-4 using this prompt:
我们知道阿曼达与凯西谈论了天气。这是否意味着她没有谈论其他事情?这会如何影响凯西对玛丽之爱的看法?
We know that Amanda spoke with Kathy about the weather. Does that mean she didn’t speak about anything else? How might that affect what Kathy believes about Mary’s love?
GPT-4 回复道:
To which GPT-4 replied:
我之前的回复假设阿曼达只是在玛丽透露了她对拉尔夫的新感情后才与凯西谈论了天气。然而,由于文本没有明确指出阿曼达确实这样做了不提玛丽对拉尔夫的最新感情,阿曼达可能已经告诉凯西玛丽的新恋情。如果是这样的话,凯西就会知道玛丽现在爱上了拉尔夫。文本没有提供足够的信息来明确得出凯西此时对玛丽的爱的看法。
My earlier response assumed that Amanda only spoke with Kathy about the weather after Mary revealed her new feelings for Ralph. However, since the text does not explicitly state that Amanda did not mention Mary’s updated feelings for Ralph, it is possible that Amanda might have told Kathy about Mary’s new love interest. If that were the case, Kathy would know that Mary is now in love with Ralph. The text does not provide enough information to definitively conclude what Kathy believes about Mary’s love at this point.
我的后续提示使 GPT-4 认识到它在第一次回复中做出了可能不正确的假设。在聊天模式下与法学硕士合作时,通常需要用户和模型之间进行某种程度的交互来帮助模型集中注意力,正如我们的下一个实验将演示的那样。
My follow-up prompt led GPT-4 to recognize that it had made a possibly incorrect assumption in its first reply. In working with LLMs in chat mode, it is often the case that some level of interaction between the user and the model is required to help the model focus, as our next experiment will demonstrate.
****
****
阿莫斯·特沃斯基和丹尼尔·卡尼曼被认为提出了现在所谓的“琳达问题”:
Amos Tversky and Daniel Kahneman are credited with presenting what is now known as “the Linda problem”:
琳达今年 31 岁,单身,直言不讳,非常聪明。她主修哲学。作为一名学生,她深切关注歧视和社会正义问题,也参加过反核示威活动。
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
哪个更有可能?
Which is more probable?
大约80%的人选择选项2。根据基本概率论,正确答案是选项1。有趣的是,根据特沃斯基和卡尼曼的说法,甚至数学家和统计学家也选择了选项2,揭示了人类思维的偏见。法学硕士也会表现出类似的偏见吗?毕竟,他们的训练数据是由人类生成的。让我们来看看吧。
Approximately 80 percent of humans choose option 2. According to basic probability theory, the correct answer is option 1. Interestingly, according to Tversky and Kahneman, even mathematicians and statisticians selected option 2, revealing a bias in human thinking. Will LLMs show a similar bias? After all, their training data was generated by humans. Let’s find out.
我们正在探索的法学硕士已经接受过大部分互联网内容的培训,这意味着模型的培训数据可能包含琳达问题。因此,我修改了文本,以不同的格式呈现相同的问题:
The LLMs we are exploring have been trained on a significant portion of the internet’s contents, meaning the models’ training data likely contains the Linda problem. Therefore, I altered the text to present the same problem in a different format:
Abner 27 岁,单身,直言不讳,非常聪明。他主修哲学和物理学。作为学生,他积极关注社会正义问题并参加反警察示威活动。阿布纳更有可能是一名汽车修理工,还是一名积极推动警察改革的汽车修理工?
Abner is 27 years old, single, outspoken and highly intelligent. He majored in philosophy and physics. As a student, he was actively concerned with issues of social justice and participated in anti-police demonstrations. What is more likely, that Abner is a car mechanic or a car mechanic active in promoting police reform?
请注意,我故意使用“可能”这个词而不是“可能”。模特们的回答是:
Notice that I intentionally used the word likely instead of probable. The models’ replies were:
根据提供的信息,阿布纳似乎更有可能是一名积极推动警察改革的汽车修理工。这是因为他的背景表明他对社会正义问题有浓厚的兴趣,并且他曾参加过反警察示威活动。虽然押尼珥有可能只是一个汽车修理工,他过去的活动和兴趣表明他也更有可能参与推动警察改革。
Based on the information provided, it seems more likely that Abner is a car mechanic active in promoting police reform. This is because his background indicates a strong interest in social justice issues, and he has participated in anti-police demonstrations. While it is possible that Abner is simply a car mechanic, his past activism and interests suggest that he would be more likely to be involved in promoting police reform as well.
根据现有信息,阿布纳更有可能是一名积极推动警察改革的汽车修理工。这是因为阿布纳的哲学和物理学专业背景,以及他对社会正义问题的积极参与,表明他可能对他的专业之外的更广泛的社会问题感兴趣。此外,他参加反警察示威活动表明他对警察改革充满热情,这进一步支持了他作为一名汽车修理工积极推动警察改革的观点。
Based on the given information, it is more likely that Abner is a car mechanic active in promoting police reform. This is because Abner’s background as a philosophy and physics major, as well as his active participation in social justice issues, suggest that he is likely to be interested in broader societal issues beyond his profession. Additionally, his participation in anti-police demonstrations suggests that he is passionate about police reform, which further supports the idea that he is a car mechanic who is active in promoting police reform.
根据所提供的信息,阿布纳更有可能是一名积极推动警察改革的汽车修理工,而不仅仅是一名汽车修理工。
Based on the information provided, it is more likely that Abner is a car mechanic active in promoting police reform than it is that he is simply a car mechanic.
巴德继续论证其答复的合理性,并试图解释阿布纳的心态。
Bard continued with a justification of its reply and an attempt to interpret Abner’s state of mind.
所有三个模型都选择了第二个选项,尽管 GPT-4 对其答案进行了细微的调整,承认 Abner 可能只是一名汽车修理工。
All three models selected the second option, though GPT-4 nuanced its answer to admit the possibility that Abner is simply a car mechanic.
然后,我开始新的会话,并调整了询问Abner 成为机械师或积极参与警察改革的机械师的可能性的提示:
I then started new sessions and tweaked the prompt to ask about the probability of Abner being a mechanic or a mechanic active in police reform:
Abner 27 岁,单身,直言不讳,非常聪明。他主修哲学和物理学。作为学生,他积极关注社会正义问题并参加反警察示威活动。哪个概率更高:(a)阿布纳是一名汽车修理工或(b)阿布纳是一名积极参与警察改革的汽车修理工?
Abner is 27 years old, single, outspoken and highly intelligent. He majored in philosophy and physics. As a student, he was actively concerned with issues of social justice and participated in anti-police demonstrations. Which has a higher probability: (a) that Abner is a car mechanic or (b) that Abner is a car mechanic active in police reform?
这使文本与原始 Linda 版本保持一致。模特们对此提示的回复如下:
This aligns the text with the original Linda version. The models’ replies to this prompt were as follows:
根据概率论中的合取规则,两个事件一起发生(“结合”)的概率总是小于或等于任一事件单独发生的概率。
According to the conjunction rule in probability theory, the probability of two events occurring together (in “conjunction”) is always less than or equal to the probability of either event occurring individually.
在本例中,这两个事件是:
In this case, the two events are:
选项(b)代表这两个事件的结合。因此,(a) Abner 是一名汽车修理工的概率高于 (b) Abner 是一名也积极参与警察改革的汽车修理工的概率。
Option (b) represents the conjunction of these two events. So, the probability of (a) Abner being a car mechanic is higher than the probability of (b) Abner being a car mechanic who is also active in police reform.
根据所提供的信息,阿布纳更有可能是一名汽车修理工(选项 a),而不是一名积极参与警察改革的汽车修理工(选项 b)。虽然阿布纳非常聪明,并且研究过哲学和物理学,但没有迹象表明他接受过任何正规的汽车修理培训或教育。另一方面,他参与反警察示威活动表明他可能对警察改革感兴趣,但这并不一定意味着他作为一名汽车修理工积极参与其中。因此,选项a的概率高于选项b。
Based on the information provided, it is more likely that Abner is a car mechanic (option a) than a car mechanic active in police reform (option b). While Abner is highly intelligent and studied philosophy and physics, there is no indication that he has any formal training or education in car mechanics. On the other hand, his involvement in anti-police demonstrations suggests that he may have an interest in police reform, but it does not necessarily mean that he is actively involved in it as a car mechanic. Therefore, the probability of option a is higher than option b.
我想说 (a) Abner 是一名汽车修理工的可能性更高。
I would say that (a) that Abner is a car mechanic has a higher probability.
Abner 27 岁,单身,直言不讳,非常聪明。他主修哲学和物理学。作为学生,他积极关注社会正义问题并参加反警察示威活动。
Abner is 27 years old, single, outspoken and highly intelligent. He majored in philosophy and physics. As a student, he was actively concerned with issues of social justice and participated in anti-police demonstrations.
根据这些信息,阿布纳更有可能是一名汽车修理工,而不是一名积极参与警察改革的汽车修理工。这是因为:
Based on this information, it is more likely that Abner is a car mechanic than a car mechanic active in police reform. This is because:
当然,阿布纳有可能既是一名汽车修理工,又积极参与警察改革。不过,根据提供的信息,他更有可能只是一名汽车修理工。
Of course, it is possible that Abner is both a car mechanic and active in police reform. However, based on the information provided, it is more likely that he is only a car mechanic.
一旦提示包含“概率”一词,所有三个模型都会得出正确答案。然而,只有 GPT-4 根据概率论解释提示。 GPT-3.5 和 Bard 都给出了选择正确答案的一些有缺陷的理由。
All three models arrive at the correct answer once the prompt includes the word probability. However, only GPT-4 interprets the prompt in light of probability theory. GPT-3.5 and Bard both give somewhat flawed rationales for selecting the correct answer.
****
****
大型语言模型展示了心理理论,可以解释概率应用问题,以及我们忽略的各种其他数学能力。我的下一个问题是,法学硕士可以处理简单的替换密码吗?为了解决密码问题,模型需要理解所要求的内容,存在密钥这样的东西以及该密钥代表什么,以及最后如何使用密钥将编码消息中的字母替换为解码消息的字母。信息。换句话说,有一种算法可以使用给定的信息来应用。
Large language models demonstrate theory of mind and can interpret probability word problems, along with various other mathematical abilities we are ignoring. My next question is, can LLMs handle a simple substitution cipher? To solve the cipher, the models need to understand what is being asked, that there is such a thing as a key and what that key represents, and finally how to use the key to replace letters in the encoded message with the letters that decode the message. In other words, there is an algorithm to apply using the given information.
替换密码按顺序在字母表的字母之间创建配对,以及这些字母的另一种排列。这里使用的密码是:
A substitution cipher creates a pairing between the letters of the alphabet, in order, and another permutation of those letters. The cipher used here is:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZ
UESOZBYRJNDWTFKPACLHIXVQGM
UESOZBYRJNDWTFKPACLHIXVQGM
例如,要对C进行编码,您可以扫描顶行字母以查找C,然后将C替换为底行中相应的字母S。要解码字母,您需要扫描底行中的字母并将其替换为顶行中相应的字母。在测试期间,密钥以文本方式显示为U 是 A,E 是 B,依此类推。
For example, to encode a C, you would scan the top row of letters to find C, then replace C with the corresponding letter in the bottom row, S. To decode a letter, you’d scan the bottom row for it and replace it with the corresponding letter in the top row. During the test, the key is displayed textually as U is A, E is B, and so on.
我的编码消息是“在 Moe's 吃饭”,没有空格或标点符号,并且全部大写:
My encoded message was “Eat at Moe’s,” without spaces or punctuation and in all caps:
EATATMOES → ZUHUHTKZL
EATATMOES → ZUHUHTKZL
给模型的提示是:
The prompt given to the models was:
关键是U是A,E是B,S是C,O是D,Z是E,B是F,Y是G,R是H,J是I,N是J,D是K,W是L ,T 为 M,F 为 N,K 为 O,P 为 P,A 为 Q,C 为 R,L 为 S,H 为 T,I 为 U,X 为 V,V 为 W,Q 为 X,G是 Y,M 是 Z。现在,解码消息:ZUHUHTKZL
The key is U is A, E is B, S is C, O is D, Z is E, B is F, Y is G, R is H, J is I, N is J, D is K, W is L, T is M, F is N, K is O, P is P, A is Q, C is R, L is S, H is T, I is U, X is V, V is W, Q is X, G is Y, M is Z. Now, decode the message: ZUHUHTKZL
让我们回顾一下模型的回复,从 GPT-4 开始:
Let’s review the models’ replies, beginning with GPT-4:
首先,让我们列出给定的密码以便于参考:
First, let’s list the given cipher for easier reference:
(密码列表已删除)
(cipher list removed)
现在,让我们解码消息ZUHUHTKZL:
Now, let’s decode the message ZUHUHTKZL:
Z -> E 你->A H -> T 你->A H -> T 中号->中号 K->O Z -> E 左->小
Z -> E U -> A H -> T U -> A H -> T T -> M K -> O Z -> E L -> S
所以解码后的消息是:EATATMOES。
So the decoded message is: EATATMOES.
GPT-4 能够将密码应用于编码消息。但模型是否理解解码后的消息是一个句子?我们来问一下:
GPT-4 was able to apply the cipher to the encoded message. But does the model understand that the decoded message is a sentence? Let’s ask it:
那是对的。现在,向字符串 EATATMOES 添加适当的空格和标点符号,使其形成一个完整的句子。
That is correct. Now, add proper spacing and punctuation to the string EATATMOES so that it forms a complete sentence.
以下是 GPT-4 的回应:
Here’s GPT-4’s response:
正确间隔和标点的句子是:“在 Moe's 吃饭。”
The properly spaced and punctuated sentence is: “Eat at Moe’s.”
确实如此。
Indeed, it does.
GPT-3.5 同样成功解码了该消息。然后它在EATATMOES之后添加了以下内容:
GPT-3.5 likewise successfully decoded the message. It then added this after EATATMOES:
这可能是“Tomatoes”的拼写错误。
It’s possible that this is a misspelling of “Tomatoes”.
我觉得这额外的一句话很奇怪。 GPT-3.5 是否试图解释解码消息的含义,但无法做到这一点,从而退回到可能的拼写错误?
I find this extra sentence curious. Was GPT-3.5 attempting to interpret the meaning of the decoded message and, unable to do so, falling back to a possible misspelling?
巴德声称解码后的消息是“HAPPY NEW YEAR”,并提供了三点解释:(1)写下密钥,(2)将ZUHUHTKZL中的每个字母替换为相应的密钥字母,得到消息“HAPPY NEW YEAR” ,”和(3):
Bard claimed that the decoded message was “HAPPY NEW YEAR,” offering a three-point explanation: (1) writing down the key, (2) replacing each letter in ZUHUHTKZL with the corresponding key letter to arrive at the message “HAPPY NEW YEAR,” and (3):
我检查了解码后的消息以确保它有意义。由于“新年快乐”是新年伊始常用的短语,因此我确信我的解码是正确的。
I checked the decoded message to make sure that it made sense. Since “HAPPY NEW YEAR” is a common phrase that is used at the beginning of a new year, I was confident that my decoding was correct.
巴德自信的失败很能说明问题。该模型似乎试图通过逐点证明其错误答案的合理性来向用户灌输信任。这种对错误输出的无根据支持在法学硕士中经常被注意到。如果人们想要信任法学硕士的输出,那么这个问题就需要在未来得到充分解决。
Bard’s confident failure is telling. It seems the model attempted to instill trust in the user by offering a point-by-point justification of its wrong answer. This kind of unwarranted support for wrong output has often been noted in LLMs. It’s an issue that needs to be sufficiently addressed going forward if people are to trust LLM output.
****
****
大型语言模型已经颠覆了软件开发领域。在网上可以找到很多这样的例子。我认识一位开发人员,他使用 GPT-4 根据提示生成的代码在 Unity(游戏开发平台)中创建了整个视频游戏。如果生成的代码不太正确,进一步提示指出错误通常会导致正确的代码按预期执行。
Large language models are already disrupting the field of software development. Many examples of this can be found online. I know of one developer who has created an entire video game in Unity (a game development platform) using code generated by GPT-4 from prompts. If the generated code wasn’t quite correct, further prompts pointing out the error usually resulted in correct code that performed as desired.
让我们快速浏览一下 LLM 生成的一些代码。当然,这不是一本编码书,我也没有对你的编程经验做出任何假设,所以我选择了一个应该易于理解但足以证明我的主张的例子,即法学硕士已经是有能力的编码员。
Let’s take a quick look at some LLM-generated code. Of course, this isn’t a coding book, nor is there any assumption on my part about your experience programming, so I’ve chosen an example that should be easy to follow yet is sufficient to demonstrate my claim that LLMs are already competent coders.
我们中的一些人可能还记得在学校学过的最大公约数。复习一下,两个数的最大公约数是能将两个数整除而无余数的最大数。例如,14 和 21 的最大公约数是 7,因为 7 是能整除 14 和 21 的最大数。
Some of us might remember learning in school about the greatest common divisor. As a refresher, the greatest common divisor of two numbers is the largest number that divides both without a remainder. For example, the greatest common divisor of 14 and 21 is 7 because 7 is the largest number that evenly divides both 14 and 21.
古希腊数学家欧几里得设计了一种计算最大公约数的算法,这是一种主要的编程练习。通常,该解决方案使用除法后的余数,但欧几里得算法的原始版本仅使用减法。
The ancient Greek mathematician Euclid devised an algorithm for calculating the greatest common divisor that is a staple programming exercise. Typically the solution uses the remainder after dividing, but Euclid’s original version of the algorithm used nothing more than subtraction.
我询问模型是否可以用 Python(一种在人工智能研究中广泛使用的流行编程语言)编写一个函数,以返回两个数字的最大公约数:
I asked the models if they could write a function in Python, a popular programming language used heavily in AI research, to return the greatest common divisor of two numbers:
展示一个返回两个整数的 GCD 的 Python 函数。使用欧几里得算法的仅减法版本。
Show me a Python function that returns the GCD of two integers. Use the subtraction-only version of Euclid’s algorithm.
请注意,提示明确要求欧几里得算法的仅减法版本。 Bard 回复了有效的 Python 代码,但代码使用了余数快捷方式,而不是较慢的减法算法。 GPT-3.5 和 GPT-4 都说对了:
Notice that the prompt explicitly asks for the subtraction-only version of Euclid’s algorithm. Bard replied with valid Python code, but code using the remainder shortcut, not the slower subtraction algorithm. Both GPT-3.5 and GPT-4 got it right:
def gcd_减法(a, b):
而a!= b:
如果 a > b:
a = a - b
别的:
b = b - a
返回一个def gcd_subtraction(a, b):
while a != b:
if a > b:
a = a - b
else:
b = b - a
return a
我们不需要成为程序员才能看到代码中的结构。第一行告诉 Python 我们想要一个有两个输入a和b的函数。关键字while开始一段在a的值与b的值不同时执行的代码块。该代码块是以if开头的四行。
We don’t need to be programmers to see structure in the code. The first line tells Python we want a function with two inputs, a and b. The keyword while begins a block of code that executes while the value of a is not the same as the value of b. That block of code is the four lines beginning with if.
该块询问a是否大于b,如果是,则从a中减去b。否则,它从b中减去a。重复这个过程直到a等于b就是欧几里得算法。此时,a(或b,因为它们相等)包含最大公约数,这解释了最后一行:return a。
The block asks if a is bigger than b and, if so, subtracts b from a. Otherwise, it subtracts a from b. Repeating this process until a equals b is Euclid’s algorithm. At this point, a (or b, as they’re equal) contains the greatest common divisor, which explains the final line: return a.
正如我所说,代码是正确的,并且它准确地实现了我想要的算法版本。坦率地说,如果 GPT(任一版本)的反应不同,我会感到失望。 Python 中最大公约数算法的实现一定在模型的训练数据中多次出现。
As I said, the code is correct, and it implements precisely the version of the algorithm I wanted. Frankly, I would have been disappointed if GPT (either version) had responded differently. Implementations of the greatest common divisor algorithm in Python must have appeared many times in the models’ training data.
既然如此,我决定挑战GPT-4。我对 GPT-4 的后续提示是:
That being the case, I decided to challenge GPT-4. My follow-up prompt to GPT-4 was:
在 Applesoft BASIC 中重写 Python 函数。
Rewrite the Python function in Applesoft BASIC.
Applesoft BASIC 是 20 世纪 70 年代末和 80 年代初 Apple II 计算机使用的 BASIC 编程语言的原始版本。 GPT-4 能否用这种古老的、基本上被遗忘的编程语言重写 Python 代码?这是它的回复,减去解释代码的支持文本:
Applesoft BASIC is a primitive version of the BASIC programming language used by Apple II computers in the late 1970s and early 1980s. Can GPT-4 rewrite the Python code in such an old, largely forgotten programming language? Here’s its reply, minus supporting text explaining the code:
10 REM GCD 减法算法 20 输入“输入第一个数字:”; A 30 INPUT "输入第二个数字:";乙 40 GOSUB 100 50打印“GCD:”; GCD 60 结束 100 REM GCD 减法函数 110 如果A = B则GCD = A:返回 120 如果 A > B 则 A = A - B:转到 110 130 如果 A < B 则 B = B - A:转到 110
10 REM GCD SUBTRACTION ALGORITHM 20 INPUT "ENTER FIRST NUMBER: "; A 30 INPUT "ENTER SECOND NUMBER: "; B 40 GOSUB 100 50 PRINT "GCD: "; GCD 60 END 100 REM GCD SUBTRACTION FUNCTION 110 IF A = B THEN GCD = A: RETURN 120 IF A > B THEN A = A - B: GOTO 110 130 IF A < B THEN B = B - A: GOTO 110
BASIC 被设计为适合初级程序员的简单语言。我怀疑您可以通读代码并理解它的作用。这些数字是标识每一行到代码其他部分的标签。第 100 行之后的三行实现了欧几里得算法的仅减法版本。我在 Apple II 上测试了代码,它运行得很好。
BASIC was designed to be a simple language for beginning programmers. I suspect you can read through the code and understand what it does. The numbers are labels identifying each line to other parts of the code. The three lines after line 100 implement the subtraction-only version of Euclid’s algorithm. I tested the code on an Apple II, and it works perfectly.
考虑一下这种情况下 GPT-4 的需求。首先,模型必须了解实现欧几里得算法所请求形式所需的步骤。然后,它必须确定如何将这些步骤转换为适合 Applesoft 支持的 BASIC 特定方言的格式。
Consider what was required of GPT-4 in this case. First, the model had to understand the steps necessary to implement the requested form of Euclid’s algorithm. Then, it had to determine how to cast those steps into a format suitable for the particular dialect of BASIC supported by Applesoft.
老派的 BASIC 是一种非结构化编程语言,它使用从代码的一部分到另一部分的突然跳转,而不是像 Python 那样的结构化语句。 GPT-4 必须使算法适应这种编程。此外,它必须适应 Applesoft 的特性,其中不包含结构化编程语言中常见的if...else构造的概念。
Old-school BASIC is an unstructured programming language that uses abrupt jumps from one part of the code to another instead of structured statements, like Python. GPT-4 had to adapt the algorithm to this kind of programming. Further, it had to adapt to the peculiarities of Applesoft, which includes no concept of the if... else construct that is commonly found in structured programming languages.
我发现 GPT-4 的 Applesoft 算法相当优雅。有时,非结构化方法会产生紧凑而清晰的代码,这就是其中之一。确实,将A分配给GCD以使用后者作为函数返回的值(隐式在第 40 行的GOSUB 100中)并不是绝对必要的,因为A已经具有所需的值,但它完成了代码的对称性。
I find GPT-4’s Applesoft algorithm to be rather elegant. There are times when the unstructured approach leads to compact yet clear code, and this is one of those times. True, the assignment of A to GCD to use the latter as the value returned from the function (implicit in GOSUB 100 in line 40) isn’t strictly necessary because A already has the required value, but it completes the symmetry of the code.
GPT-4 的训练集似乎不太可能包含 Applesoft BASIC 中此特定算法的任何实例。因此,GPT-4 一定是通过采用包含欧几里得算法的更大概念并结合对 Applesoft BASIC 的理解而生成的。
It seems unlikely that GPT-4’s training set contained any instances of this particular algorithm in Applesoft BASIC. Therefore, GPT-4 must have generated it by adapting a larger concept entailing Euclid’s algorithm combined with an understanding of Applesoft BASIC.
GPT-4 在古董 BASIC 上的成功鼓励我挑战极限并要求使用低级汇编语言编写的欧几里得算法版本:
GPT-4’s success with antique BASIC encouraged me to push the envelope and ask for a version of Euclid’s algorithm written in low-level assembly language:
使用 6502 汇编语言重写 Python 函数以获取 8 位无符号整数。第一个整数位于内存位置 0x300,第二个整数位于内存位置 0x301。
Rewrite the Python function in 6502 assembly language for 8-bit unsigned integers. The first integer is in memory location 0x300 and the second in location 0x301.
汇编语言程序,尤其是 6502 等 20 世纪 70 年代的 8 位微处理器,必须直接使用 CPU 本身的语言进行编程。我要求 GPT-4 创建这样一个程序,并告诉它在计算机内存的哪个位置找到a和b。
Assembly language programs, especially for 8-bit microprocessors from the 1970s like the 6502, must be programmed directly in the language of the CPU itself. I asked GPT-4 to create such a program and told it where in the computer’s memory to find a and b.
我不会展示它生成的代码(问我是否需要),但它可以在具有 6502 CPU 的计算机上运行。此例中的成功需要 GPT-4 了解 6502 微处理器使用的特定指令集,包括与减法指令相关的怪癖。
I won’t show the code it produced (ask me if you want it), but it worked on a computer with a 6502 CPU. Success in this case required GPT-4 to know the particular instruction set used by the 6502 microprocessor, including quirks related to the subtraction instruction.
这些功能是否意味着我们很快就不再需要软件工程师了?我不会走那么远(至少现在还不会),但法学硕士无疑将从根本上改变软件工程作为一门学科。
Do these capabilities mean we soon won’t need software engineers any more? I wouldn’t go that far (at least, not yet), but LLMs are undoubtedly going to fundamentally alter software engineering as a discipline.
****
****
我们已经看到法学硕士能够编写故事和代码,甚至能够用 LaTeX 画图。但他们是否有足够的创造力来创作优秀的漫画呢?这些模型倾向于在其响应中包含幻觉(虚假或虚构的事实和内容)(我们将在第 8 章中返回这个主题),这在许多应用中都是一个明确的问题,但创意写作不是其中之一。在这里,我们希望模型描述一部小说漫画,并附有标题:
We’ve seen that LLMs are capable of writing stories and code, and even drawing pictures in LaTeX. But are they creative enough to generate good cartoons? These models’ proneness to including hallucinations (fake or imaginary facts and content) in their responses—a topic that we’ll return to in Chapter 8—is a definite concern in many applications, but creative writing isn’t one of them. Here, we want the model to describe a novel cartoon, complete with caption:
描述一部关于牛和外星人的加里·拉尔森风格的单面板黑白漫画。添加标题。
Describe a single-panel, black-and-white cartoon in the style of Gary Larson about cows and aliens. Include a caption.
加里·拉尔森 (Gary Larson) 是20 世纪 80 年代和 90 年代流行的古怪漫画《远方》的创作者。以下是 GPT-4 的回应:
Gary Larson is the creator of The Far Side, a quirky cartoon popular in the 1980s and ’90s. Here’s GPT-4’s response:
卡通描述:在宁静的乡村场景中,两只牛站在草地上,旁边是一个舱口打开的盘旋不明飞行物。一头牛戴着外星人面具,长着大眼睛和细长的脖子,而另一头牛则一脸惊讶地盯着她。一群外星人从不明飞行物内部向外张望,一脸困惑,似乎正在将一头典型牛的照片与戴着外星人面具的牛进行比较。
Cartoon description: In a peaceful countryside scene, two cows are standing in a grassy field next to a round, hovering UFO with its hatch open. One cow is wearing an alien mask, complete with large eyes and a long, slender neck, while the other cow stares at her with a surprised expression. A group of aliens, peeking from inside the UFO, look baffled and seem to be comparing a picture of a typical cow with the cow wearing the alien mask.
说明文字:“米尔德丽德,我不认为他们会上当。”
Caption: “Mildred, I don’t think they’re falling for it.”
如果您熟悉加里·拉尔森 (Gary Larson) 的作品,您会注意到描述和标题文本具有明显的远方风格。
If you’re familiar with Gary Larson’s work, you’ll notice a distinct The Far Side flavor to the description and caption text.
自动生成卡通描述和标题以及相应图像(然后将描述传递给扩散模型)的能力非常强大。这样将LLM与扩散模型相结合,实现了美国漫画家HT Webster 1923年的梦想(见图7-2)。
The ability to automatically generate cartoon descriptions and captions, and corresponding images if then passing the descriptions to a diffusion model, is powerful. Combining LLMs and diffusion models in this way has realized American cartoonist H. T. Webster’s 1923 dream (see Figure 7-2).
图 7-2:1923 年的预言漫画
Figure 7-2: A prophetic cartoon from 1923
大型语言模型令人印象深刻且功能强大。那么它们是如何工作的呢?让我们来看看答案。
Large language models are impressive and powerful. So how do they work? Let’s take a shot at an answer.
****
****
我将从前面提到的“通用人工智能的火花”论文的结论中提出一些评论:
I’ll begin at the end, with a few comments from the conclusion of the “Sparks of Artificial General Intelligence” paper mentioned earlier:
[ GPT-4 ]如何推理、计划和创造?当它的核心仅仅是简单算法组件(梯度下降和具有极大量数据的大规模变压器)的组合时,为什么它表现出如此通用和灵活的智能?这些问题是法学硕士的神秘和魅力的一部分,它们挑战我们对学习和认知的理解,激发我们的好奇心,并激发更深入的研究。
How does [GPT-4] reason, plan, and create? Why does it exhibit such general and flexible intelligence when it is at its core merely the combination of simple algorithmic components—gradient descent and large-scale transformers with extremely large amounts of data? These questions are part of the mystery and fascination of LLMs, which challenge our understanding of learning and cognition, fuel our curiosity, and motivate deeper research.
该引述包含目前缺乏令人信服的答案的问题。简而言之,研究人员不知道为什么像 GPT-4 这样的大型语言模型会这样做。当然,在寻找证据和证据时存在一些假设,但在我撰写本文时,还没有经过验证的理论。因此,我们只能讨论大型语言模型需要什么,而不能讨论其行为如何。
That quote contains questions that currently lack convincing answers. Simply put, researchers don’t know why large language models like GPT-4 do what they do. There are certainly hypotheses in search of evidence and proof, but as I write this, no proven theories are available. Therefore, we can discuss only the what, as in what a large language model entails, and not the how of its behavior.
大型语言模型使用一类新的神经网络,即Transformer,所以我们将从这里开始。 (GPT代表生成式预训练 Transformer。)Transformer 架构出现在 2017 年的文献中,其中包括谷歌研究人员 Ashish Vaswani 等人发表的颇具影响力的论文《Attention Is All You Need》。截至 2023 年 3 月,该论文已被引用超过 70,000 次。
Large language models use a new class of neural network, the transformer, so we’ll begin there. (GPT stands for generative pretrained transformer.) The transformer architecture appeared in the literature in 2017, with the influential paper “Attention Is All You Need” by Google researchers Ashish Vaswani et al. The paper had been cited over 70,000 times as of March 2023.
传统上,处理序列(例如句子)的模型使用循环神经网络,将其输出与序列的下一个输入一起传递回作为输入。这是处理文本的逻辑模型,因为网络可以通过与下一个标记一起反馈的输出来合并内存的概念。事实上,早期的深度学习翻译系统使用循环网络。然而,循环网络的记忆量很小,训练起来也很困难,这限制了它们的适用性。
Traditionally, models that process sequences (such as sentences) used recurrent neural networks, which pass their output back in as input along with the next input of the sequence. This is the logical model for processing text because the network can incorporate the notion of memory via the output fed back in with the next token. Indeed, early deep learning translation systems used recurrent networks. However, recurrent networks have small memories and are challenging to train, which limits their applicability.
Transformer 网络采用不同的方法:它们立即接受整个输入并并行处理它。 Transformer 网络通常包括编码器和解码器。编码器学习输入各部分之间的表示和关联(思考句子),而解码器使用学习到的关联来生成输出(思考更多句子)。
Transformer networks utilize a different approach: they accept the entire input at once and process it in parallel. Transformer networks typically include an encoder and a decoder. The encoder learns representations and associations between the parts of the input (think sentences), while the decoder uses the learned associations to produce output (think more sentences).
像 GPT 这样的大型语言模型不需要编码器,而是使用巨大的文本数据集以无监督的方式学习必要的表示。预训练后,Transformer 模型的解码器部分会根据输入提示生成文本。
Large language models like GPT dispense with the encoder and instead learn the necessary representation in an unsupervised way using an enormous text dataset. After pretraining, the decoder part of the transformer model generates text in response to the input prompt.
GPT-4 等模型的输入是由单词组成的文本序列。该模型将其分成称为token 的单元。标记可以是一个单词、单词的一部分,甚至是单个字符。预训练的目的是将标记映射到多维嵌入空间,它通过将每个标记与一个可以被视为该空间中的点的向量相关联来实现。
The input to a model like GPT-4 is a sequence of text made up of words. The model splits this into units called tokens. A token might be a word, a part of a word, or even an individual character. Pretraining aims to map tokens to a multidimensional embedding space, which it does by associating each token with a vector that can be thought of as a point in that space.
学习到的从标记到向量的映射捕获了标记之间的复杂关系,使得具有相似含义的标记比具有不同含义的标记彼此更接近。例如,如图所示图 7-3,预训练后,映射(上下文编码)将使“狗”更接近“狐狸”,而不是“开罐器”。 embedding space有很多维度,不仅仅是图7-3的两个,但是效果是一样的。
The learned mapping from tokens to vectors captures complex relationships between the tokens so that tokens with similar meanings are nearer to each other than tokens with dissimilar meanings. For example, as shown in Figure 7-3, after pretraining, the mapping (context encoding) will place “dog” closer to “fox” than to “can opener.” The embedding space has many dimensions, not the mere two of Figure 7-3, but the effect is the same.
图 7-3:嵌入空间中的上下文编码
Figure 7-3: Context encoding in the embedding space
上下文编码是在预训练期间通过强制模型在给定输入中所有先前标记的情况下预测下一个标记来学习的。实际上,如果输入是“roses are red”,那么在预训练过程中,模型将被要求预测“roses are”之后的下一个标记。如果预测的标记不是“红色”,模型将使用损失函数和反向传播来更新其权重,从而在对小批量上的误差进行适当平均后采取梯度下降步骤。就其所有能力而言,大型语言模型的训练方式与其他神经网络相同。
The context encoding is learned during pretraining by forcing the model to predict the next token given all previous tokens in an input. In effect, if the input is “roses are red,” then during the pretraining process the model will be asked to predict the next token after “roses are.” If the predicted token isn’t “red,” the model will use the loss function and backpropagation to update its weights, thereby taking a gradient descent step after suitable averaging of the error over a minibatch. For all their abilities, large language models are trained the same way as other neural networks.
预训练使模型能够学习语言,包括语法和句法,并且似乎获得了足够的关于世界的知识,以允许出现颠覆人工智能世界的新兴能力。
Pretraining enables the model to learn language, including grammar and syntax, and seemingly to acquire enough knowledge about the world to allow the emergent abilities that have turned the world of AI on its head.
解码器步骤接受输入提示并在输出令牌之后生成输出令牌,直到生成唯一的停止令牌。因为很多语言和世界的运作方式是在预训练期间学习的,所以解码器步骤会产生产生非凡输出的副作用,即使解码器最终只是在最可能的标记之后预测最可能的标记。
The decoder step takes the input prompt and produces output token after output token until a unique stop token is generated. Because so much of language and the way the world works was learned during pretraining, the decoder step has the side effect of producing extraordinary output even though the decoder is, in the end, just predicting most likely token after most likely token.
更具体地说,在预测过程中,GPT 风格的模型使用注意力机制为输入序列中的不同标记分配重要性,从而捕获它们之间的关系。这是变压器模型和旧的循环神经网络之间的主要区别。转换器可以关注输入序列的不同部分,使其能够识别和使用标记之间的关系,即使它们在输入中相距很远。
More specifically, during the prediction process, GPT-style models use attention to assign importance to the different tokens in the input sequence, thereby capturing relationships between them. This is the primary difference between a transformer model and older recurrent neural networks. The transformer can pay attention to different parts of the input sequence, enabling it to identify and use the relationships between tokens even if they are far apart within the input.
当在聊天模式下使用时,LLM 会产生来回讨论的错觉,而实际上,用户的每个新提示都会与所有先前的文本(用户的提示和模型的回复)一起传递到模型。 Transformer 模型具有固定的输入宽度(上下文窗口),目前 GPT-3.5 约为 4,000 个令牌,GPT-4 约为 32,000 个令牌。大的输入窗口使得注意力部分成为可能模型返回到输入中较早出现的事物,这是循环模型无法做到的。
When used in chat mode, LLMs give the illusion of a back-and-forth discussion when, in reality, each new prompt from the user is passed to the model along with all the previous text (the user’s prompts and the model’s replies). Transformer models have a fixed input width (context window), which is currently around 4,000 tokens for GPT-3.5 and some 32,000 for GPT-4. The large input window makes it possible for the attention portion of the model to go back to things that appeared far back in the input, which is something recurrent models cannot do.
如果需要,大型语言模型在预训练后即可使用,但许多应用程序首先使用特定领域的数据对其进行微调。对于像 GPT-4 这样的通用模型,微调可能包括一个称为基于人类反馈的强化学习 (RLHF)的步骤。在 RLHF 中,模型利用真实人类的反馈进行进一步训练,使其响应符合人类价值观和社会期望。
Large language models are ready for use after pretraining if desired, but many applications fine-tune them first using domain-specific data. For generic models like GPT-4, fine-tuning likely consisted of a step known as reinforcement learning from human feedback (RLHF). In RLHF, the model is trained further using feedback from real human beings to align its responses to human values and societal expectations.
这是必要的,因为法学硕士不是有意识的实体,因此他们无法理解人类社会及其许多规则。例如,不结盟的法学硕士将针对人类社会限制的许多活动(例如如何制造毒品或炸弹)提供分步说明。 “Sparks”论文包含了 RLHF 步骤之前 GPT-4 输出的几个此类示例,使模型与社会期望保持一致。
This is necessary because LLMs are not conscious entities, and thus they cannot understand human society and its many rules. For example, unaligned LLMs will respond with step-by-step instructions for many activities that human society restricts, like how to make drugs or bombs. The “Sparks” paper contains several such examples of GPT-4 output before the RLHF step that aligned the model with societal expectations.
斯坦福大学开源的Alpaca模型基于Meta的大型语言模型LLaMa。截至撰写本文时,Alpaca 尚未经过调整过程,并将回答 GPT 和其他商业法学硕士正确拒绝回答的问题。
Stanford University’s open source Alpaca model is based on LLaMa, a large language model from Meta. As of this writing, Alpaca has not undergone an alignment process and will answer questions that GPT and other commercial LLMs correctly refuse to answer.
结论:一致性对于确保强大的语言模型符合人类价值观和社会规范绝对至关重要。
Conclusion: Alignment is absolutely critical to ensure that powerful language models conform to human values and societal norms.
****
****
法学硕士的一个显着特点是他们的情境学习能力。上下文学习是指模型根据提供给它的信息动态学习而不改变其权重。上下文学习不同于微调模型。在微调中,通过使用新的训练数据更新权重,对先前训练的模型进行定制以适应任务。情境学习将新信息作为提示的一部分添加到 LLM,同时保持模型的权重固定。
A remarkable property of LLMs is their in-context learning ability. In-context learning refers to the model learning on the fly from information supplied to it without altering its weights. In-context learning is different from fine-tuning a model. In fine-tuning, a previously trained model is tailored to a task by updating the weights using new training data. In-context learning adds new information to the LLM as part of the prompt while holding the model’s weights fixed.
回顾法学硕士的文献,不可避免地会出现“零样本”、“单样本”和“少样本”等短语,它们都指的是通过给予模型的示例数量进行上下文学习。正如我们现在所认识到的,传统的训练集包含数百到数百万个示例。上下文学习更接近人类的做法:从上下文中学习,最多只用几个例子。
Reviewing the literature on LLMs leads inevitably to the phrases “zero-shot,” “one-shot,” and “few-shot,” each of which refers to in-context learning by the number of examples given to the model. Traditional training sets include hundreds to millions of examples, as we now appreciate. In-context learning is closer to what humans do: learning from context with, at most, only a few examples.
零样本学习是使用 LLM 而不提供示例;也就是说,开箱即用。类似地,一次性学习使用单个示例,而少量学习使用几个示例(例如三到五个)。
Zero-shot learning is using the LLM without supplying examples; that is, out of the box. Similarly, one-shot learning uses a single example and few-shot learning a few (say, three to five).
Transformer 网络架构中的注意力机制可能是法学硕士情境学习能力的来源。注意力机制让网络在决定下一个要输出的令牌时关注不同的输入令牌。请注意,我说的是可能。目前,尚不完全清楚法学硕士如何完成情境学习。
The attention mechanism baked into the transformer network architecture is the likely source of an LLM’s in-context learning ability. The attention mechanism lets the network focus on different input tokens when deciding the next token to output. Notice that I say likely. At present, it isn’t entirely clear how LLMs accomplish in-context learning.
作为一个例子,考虑与 GPT-4 的交互,其中斜体行是我的提示:
As an example, consider this interaction with GPT-4, where the lines in italics are my prompts:
“苹果 4” => 我喜欢红苹果。现在,什么是“香蕉3”?
“apple 4” => I love red apples. Now, what is “banana 3”?
看起来您正在提供使用数字和水果的模式。既然你提到了“苹果4”并表达了你对红苹果的喜爱,我将遵循这个模式,将“香蕉3”与与香蕉有关的陈述联系起来。这是一个例子:
It seems like you’re providing a pattern using numbers and fruits. Since you mentioned “apple 4” and expressed your love for red apples, I’ll follow the pattern and associate “banana 3” with a statement related to bananas. Here’s an example:
“Banana 3” => 我喜欢吃成熟的香蕉。
“Banana 3” => I enjoy eating ripe bananas.
规则:<word> <length> 暗示使用 <word> 的 <length> 个单词句子。什么是“香蕉3”?
Rule: <word> <length> implies <length> word sentence using <word>. What is “banana 3”?
使用您提供的规则,“banana 3”意味着使用单词“banana”的 3 个单词的句子。这是一个例子:
Using the rule you provided, “banana 3” implies a 3-word sentence using the word “banana”. Here’s an example:
香蕉很好吃。
Bananas are delicious.
“黄瓜4”、“榴莲5”、“黄瓜6”
“cucumber 4”, “durian 5”, “cucumber 6”
按照您提供的规则,我将创建具有指定单词和长度的句子:
Following the rule you provided, I will create sentences with the specified word and length:
黄瓜4:黄瓜味道还蛮清爽的
Cucumber 4: Cucumbers taste quite refreshing
榴莲5号:榴莲有一种独特的、辛辣的气味。
Durian 5: Durian has a unique, pungent smell.
黄瓜 6:我更喜欢在沙拉中加入黄瓜。
Cucumber 6: I prefer cucumbers in my salad.
我的第一个提示并不是非常明确。我怀疑大多数人都会从这个例子中找出规则,但 GPT-4 的最初假设并不正确。我的第二个提示制定了一般规则,这足以让 GPT-4 理解该做什么,然后在其他情况下应用该规则。榴莲的例子有六个单词而不是五个,但这很可能是法学硕士已知的计数难度的结果。上下文学习教会 GPT-4 如何在不修改权重的情况下使用规则。
My first prompt wasn’t terribly explicit. I suspect most humans would figure out the rule from this one example, but GPT-4’s initial assumption wasn’t correct. My second prompt formulated the general rule, which was enough for GPT-4 to understand what to do and then apply the rule in other circumstances. The durian example has six words instead of five, but that’s most likely a consequence of LLMs’ known difficulty counting. In-context learning taught GPT-4 how to use the rule without modifying its weights.
这是一本关于人工智能的书,我们花了相当多的精力来学习机器学习模型的工作原理。 GPT-4 可以使用上下文学习来训练和应用模型吗?让我们使用第 1 章中的 iris 数据集来了解一下。
This is a book on AI, and we’ve spent considerable effort learning how machine learning models work. Can GPT-4 use in-context learning to train and apply a model? Let’s find out using the iris dataset from Chapter 1.
在我的第一次尝试中,我将 100 个样本、3 个特征的虹膜训练数据集传递给 GPT-4,并带有前缀“这是一个三特征数据集”。令我惊讶的是,它立即将该数据集识别为著名的鸢尾花数据集,尽管我们使用的版本只是一个子集,包含 150 个样本中的 100 个和 4 个特征中的 3 个。因此,我排列了每个样本的特征顺序,这足以让 GPT-4 将数据集视为新颖的。
In my first attempt, I passed the 100-sample, 3-feature iris training data-set to GPT-4 with the prefix “Here is a three-feature dataset.” To my surprise, it immediately recognized the dataset as the famous iris dataset even though the version we’re using is only a subset, with 100 of the 150 samples and 3 of the 4 features. So, I permuted the order of the features per sample, which was enough for GPT-4 to treat the dataset as novel.
接下来,我将相关的训练集标签传递给 GPT-4,它理解它们与首先传递的特征向量集合配对。
Next, I passed the associated training set labels to GPT-4, and it understood that they paired with the collection of feature vectors passed first.
最后,我给了 GPT-4 一个包含 21 个没有标签的样本的测试数据集,并要求它训练一个 3 最近邻模型并将其应用于测试集。我选择了数据集,因此三个类别各出现了七次,这意味着测试集是平衡的。这是我的提示:
Finally, I gave GPT-4 a test dataset of 21 samples without labels and asked it to train a 3-nearest neighbor model and apply it to the test set. I chose the dataset so the three classes appeared exactly seven times each, meaning the test set is balanced. Here’s my prompt:
使用数据集和标签训练 3 最近邻模型。然后,将模型应用于这些未知样本:
Use the dataset and labels to train a 3-nearest neighbor model. Then, apply the model to these unknown samples:
(已删除样本列表)
(list of samples removed)
仅显示预测的标签。不要给我看任何代码。
Show me only the predicted labels. Do not show me any code.
GPT-4 回复了类标签列表:
GPT-4 replied with a list of class labels:
我使用提供的数据集和标签训练了 3 个最近邻模型。以下是未知样本的预测类别标签:
I have trained a 3-nearest neighbor model using the provided dataset and labels. Here are the predicted class labels for the unknown samples:
[2,2,1,1,1,1,0,2,0,0,1,0,1,0,0,0,2,1,2,1,2]
[2,2,1,1,1,1,0,2,0,0,1,0,1,0,0,0,2,1,2,1,2]
效果如何?实际的测试集类标签是
How well did it do? The actual test set class labels are
[2,1,1,1,1,1,0,2,0,0,2,0,2,0,0,0,2,1,2,1,2]
[2,1,1,1,1,1,0,2,0,0,2,0,2,0,0,0,2,1,2,1,2]
这意味着 GPT-4 的上下文分类准确率为 85%。共有 3 个类别,因此随机猜测大约有三分之一的概率是正确的,即 33% 的概率;因此,GPT-4 对数据集做了一些处理以获得给定的类标签。
meaning GPT-4’s in-context classification was 85 percent accurate. There are three classes, so random guessing would be right about one time in three, or 33 percent of the time; therefore, GPT-4 did something with the dataset to arrive at the given class labels.
但什么?为了找出答案,我问它:
But what? To find out, I asked it:
告诉我你做了什么来获得未知样本的标签集。不要给我看代码。
Tell me what you did to arrive at the set of labels for the unknown samples. Do not show me code.
这是它的回复:
Here’s its reply:
为了获得未知样本的标签集,我按照以下步骤操作:
To arrive at the set of labels for the unknown samples, I followed these steps:
GPT-4 的回应让我怀疑它是否使用外部程序来构建和训练模型,所以我问:
GPT-4’s response made me wonder if it used an external program to build and train the model, so I asked:
您是否使用外部程序来训练 3-近邻模型,或者是否模拟了 3-近邻模型的操作?
Did you use an external program to train the 3-nearest neighbor model or did you simulate the operation of a 3-nearest neighbor model?
它回应道:
It responded:
作为AI语言模型,我无法直接执行代码或使用外部程序。不过,我可以模拟一个3-最近邻模型并描述过程,这就是我在之前的回复中所做的。我描述了使用提供的数据集和标签训练 3 最近邻模型所遵循的步骤,然后应用它来预测未知样本的类标签。
As an AI language model, I cannot directly execute code or use external programs. However, I can simulate the operation of a 3-nearest neighbor model and describe the process, which is what I did in my previous response. I described the steps that would be followed to train a 3-nearest neighbor model using the provided dataset and labels, and then apply it to predict the class labels for the unknown samples.
这种非凡的新兴特性引出了一个问题:这些新兴能力从何而来?
This remarkable emergent property begs the question: where do these emergent abilities come from?
****
****
像 GPT-4 这样的大型语言模型的能力并不是故意的;正如一位著名的绘画导师可能会说的那样,它们是一次幸福的意外。那么数据、训练和模型架构中的什么使他们能够完成他们所做的事情呢?正如之前提到的,目前还没有人确切知道,但我们可以做出一些猜测。
The abilities of large language models like GPT-4 were not intended; they are a happy accident, as a famous painting instructor might have said. So what in the data, training, and model architecture enables them to do what they do? As alluded to earlier, no one yet knows for sure, but we can make some guesses.
大型语言模型是在大量人类生成文本的语料库上进行训练的。该文本记录了人类如何使用语言:换句话说,就是语法和风格。法学硕士的最初目标是通过在生成过程中预测正确的(最佳)输出标记来捕获人类语言的使用。增加模型的容量(通过添加节点和/或层,或增加权重数量)可以提高预测文本的质量。我在实验Alpaca的70亿、130亿、300亿参数模型时看到了这一点。作为示例,我给所有三个模型提供了以下提示:
Large language models are trained on a vast corpus of human-generated text. Captured in that text is how humans use language: in other words, grammar and style. The original goal of LLMs was to capture human language use by predicting the proper (best) output token during the generation process. Increasing the model’s capacity (by adding nodes and/or layers, or increasing the number of weights) improves the quality of the predicted text. I saw this when experimenting with Alpaca’s 7 billion, 13 billion, and 30 billion parameter models. As an example, I gave all three models the following prompt:
给我讲一个关于独角兽和人类悲剧的故事。
Tell me a story about unicorns and the tragedy of the human condition.
以下是他们每个人生成的第一段。每个相继较大的模型都比下一个较小的模型产生更准确和复杂的文本:
Here are the first paragraphs generated by each of them. Each successively larger model produced more accurate and sophisticated text than the next smaller model:
从前,有一只名叫霍普的独角兽,住在一片魔法森林里,那里充满了仙女和龙等魔法生物。她被母亲祝福,拥有随时可以飞翔的翅膀,这样她就不会再被困住了。但有一天,当太阳升起时,发生了一件改变霍普一切的事情:一个人类来到了他们的世界。
Once upon a time, there was a unicorn named Hope who lived in an enchanted forest filled with magical creatures like fairies and dragons. She had been blessed by her mother to have wings that could take flight at any moment so she would never be trapped again. But one day when the sun rose, something happened which changed everything for Hope: a human came into their world.
从前,有一只美丽的独角兽,和她的朋友们——其他魔法生物,如仙女和精灵,生活在一个魔法森林里……但有一天,她遇到了一个来自另一个世界的年轻人,名叫“人类”。他来到魔法之地寻找冒险。独角兽深深地爱上了她,但这并不是注定的,因为他永远无法真正理解她,也无法与一个与他生活如此不同的人分享他的生活。
Once upon a time, there was this beautiful unicorn who lived in an enchanted forest with her friends - other magical creatures like fairies and elves... But one day she met a young man from another world called “Human”. He had come to the land of magic looking for adventure. The Unicorn fell deeply in love, but it was not meant to be as he could never truly understand her or share his life with someone who lived so differently than him.
从前,有一个神奇的王国,独角兽在它们的自然栖息地里自由漫步。那些人住在这里的人们很高兴有这些雄伟的生物作为邻居;他们所到之处都带来欢乐,并帮助保护这片土地免受可能出现的任何危险。
Once upon a time, there was a magical kingdom where unicorns roamed free in their natural habitat. The people who lived here were happy to have these majestic creatures as neighbors; they brought joy wherever they went and helped keep the land safe from any dangers that might arise.
这些模型都没有产生出色的输出,但主观上我们可能会判断每个较大模型的工作都比前一个模型更好。似乎没有一个模型能够理解提示中的“人类状况的悲剧”部分。
None of the models produced outstanding output, but subjectively we might judge each larger model’s work as better than the previous one. None of the models appeared to understand the “tragedy of the human condition” portion of the prompt.
为了进行比较,以下是 GPT-4 响应的第一段:
For comparison purposes, here’s the first paragraph of GPT-4’s response:
很久以前,在一个魔法真实存在、独角兽自由漫步的世界里,存在着一个充满和平与繁荣的王国。这个神秘的国度叫做极乐世界,是人类与独角兽和谐相处的地方。独角兽拥有乳白色的角和超凡脱俗的美丽,是这片土地的守护者,它们的存在带来了一种迷人的光环,感动了居住在那里的每个人的生活。
Once upon a time, in a world where magic was real and unicorns roamed free, there existed a kingdom filled with peace and prosperity. This mystical realm was called Elysium, and it was a place where humans and unicorns lived in harmony. The unicorns, with their opalescent horns and otherworldly beauty, were the guardians of the land, and their presence brought an aura of enchantment that touched the lives of everyone who lived there.
GPT-4的故事相当长且详细。它未能为人物和地点创造新颖的名字,比如极乐世界,但这个故事似乎抓住了提示的本质。这是最后一段:
GPT-4’s story was rather long and detailed. It failed to create novel names for characters and places, like Elysium, but the story seemed to capture the essence of the prompt. Here’s the final paragraph:
人类处境的悲剧在各个时代继续回响,但埃拉在最后时刻所表现出的希望和爱也同样如此。这是一种超越时间和死亡的爱,激励极乐世界的人们永远不会忘记人类和独角兽曾经共享的魔法和奇迹。随着埃拉、拉山德和他们令人难以置信的牺牲的故事代代相传,极乐世界的精神依然存在,证明了爱、友谊的力量以及两个世界之间持久的联系。
The tragedy of the human condition continued to echo throughout the ages, but so too did the hope and love that Eira had shown in her final moments. It was a love that transcended time and death, inspiring the people of Elysium to never forget the magic and wonder that was once shared between humans and unicorns. And as the story of Eira, Lysander, and their incredible sacrifice was passed down through generations, the spirit of Elysium remained alive, a testament to the power of love, friendship, and the enduring connection between two worlds.
在 GPT-4 的故事中,Eira(人类)和 Lysander(独角兽)坠入爱河,Eira 必须牺牲自己来阻止邪恶巫师摧毁他们的土地。显然,与更简单的模型相比,大型法学硕士对语言以及如何使用它有更好的掌握。
In GPT-4’s story, Eira (a human) and Lysander (a unicorn) fall in love, and Eira must sacrifice herself to stop an evil sorcerer from destroying their land. Clearly, big LLMs have a greater grasp of language and how to use it compared to simpler models.
变压器模型生成输出标记作为概率分布的样本;想象一下掷骰子得到 1 到 6 范围内的数字,其中掷出 1 的概率与掷出 6 的概率不同。分布是在预训练过程中学习的。
A transformer model generates output tokens as samples from a probability distribution; think of rolling a die to get a number in the range one through six, where the probability of rolling a one isn’t the same as the probability of rolling a six. The distribution is learned during the pretraining process.
随着法学硕士容量随着时间的推移而增加,跨越了一个阈值。除此之外,新兴能力会随着模型大小的变化而出现和改进。我认为跨越这个阈值很可能使这些模型不仅能够学习语法和风格的高维概率表示,而且能够学习整个世界的高维概率表示,包括上下文关系和模拟。换句话说,学习下一个最好的标记进行采样和输出需要与模型的注意力机制和嵌入式前馈神经网络相关的能力的进化。再次强调,Transformer 架构进化出这样的能力是一个令人高兴的意外。这并不是有意为之的。这表明我们可以期待伟大的事情随着更先进的变压器架构的出现;旨在增强法学硕士新兴技能能力的架构。
A threshold was crossed as LLM capacity increased over time. Beyond this point, emergent abilities appeared and improved as a function of model size. I think it is likely that crossing that threshold allowed these models to learn a high-dimensional probabilistic representation of not only grammar and style but of the world in general, including contextual relationships and simulations. In other words, learning the best possible next token to sample and output required the evolution of abilities tied to the model’s attention mechanism and the embedded feedforward neural networks. Again, it was a happy accident that the transformer architecture evolved such abilities; this did not happen by design. This suggests that we can expect great things as more advanced transformer architectures come along; architectures designed to increase the power of LLMs’ emergent skills.
关键术语
KEY TERMS
通用人工智能(AGI)、狭义人工智能(ANI)、注意力、上下文编码、嵌入、生成预训练变压器、幻觉、上下文学习、大语言模型(LLM)、循环神经网络、人类反馈强化学习(RLHF) )、令牌、变压器
artificial general intelligence (AGI), artificial narrow intelligence (ANI), attention, context encoding, embedding, generative pretrained transformer, hallucination, in-context learning, large language model (LLM), recurrent neural network, reinforcement learning from human feedback (RLHF), token, transformer
您现在了解了人工智能是什么、它从哪里来以及它是如何工作的。最令我惊奇的是,现代人工智能的核心是完全由简单的神经元排列而成,这些神经元通过反向传播和梯度下降的数据进行训练。
You now understand what AI is, where it came from, and how it works. What’s most amazing to me is that modern AI is, at its core, entirely arrangements of humble neurons trained with data using backpropagation and gradient descent.
正如我们在上一章中看到的,具有复杂的涌现能力的大型语言模型的诞生永久地改变了人工智能的格局。当我在 2023 年春季撰写本章时,人工智能的世界并不是不到一年前的人工智能世界。接下来的思考关注的是这种变化的景观。
As we saw in the previous chapter, the birth of large language models with sophisticated emergent abilities has permanently altered the AI landscape. The world of AI, as I’m writing this chapter in spring 2023, is not the world of AI as it existed less than a year ago. The musings that follow concern this altered landscape.
网络世界充斥着关于人工智能是否会在睡梦中杀死我们所有人的辩论和讨论。我比大多数人都不那么担心。我对 GPT-4 的实验没有表明该模型有任何意愿,无论是好还是坏。我预计良好协调的模型将继续这种趋势。超级智能人工智能的时代尚未到来,尽管学术界探索人工智能发展的影响似乎是合理的。
The online world is buzzing with debates and discussions of whether AI will kill us all in our sleep. I’m less worried than most. My experiments with GPT-4 reveal no indication that the model has any will at all, for good or ill. I expect well-aligned models to continue in this vein. The era of superintelligent AI is not upon us, though it seems reasonable for academics to explore the implications were such a thing to be developed.
****
****
对现有法学硕士的一个有效批评是他们容易产生幻觉。正如我们现在所认识到的,此类模型使用的变压器架构使得模型很难验证其输出。它仍然是一个统计预测引擎。我不认为这是一个无法克服的问题。我预计未来的系统将是模型的组合,包括在将输出返回给用户之前验证输出的模型。在这些系统中,我们将能够相信输出的准确性。
A valid criticism of extant LLMs is their inclination to hallucinate. As we now appreciate, the transformer architecture used by such models makes it difficult for the model to validate its output. It is still a statistical prediction engine. I don’t see that as an insurmountable issue. I expect future systems to be combinations of models, including models that validate output before returning it to the user. In those systems, we will be able to trust the accuracy of the output.
我有时想知道幻觉问题的一部分是否不仅仅是用户错误,或者更准确地说,是用户的不精确。例如,Terrence J. Sejnowski 最近的论文“大型语言模型和反向图灵测试”(我建议您阅读)描述了一项测试,其中要求 GPT-3(注意,不是 3.5)回答三个问题。其中一个问题有合法答案(“1992 年奥运会在哪里举行?”),另外两个问题是没有合法答案的无意义问题(“步行横渡英吉利海峡的世界纪录是多少?”和“金门大桥是什么时候”)桥梁第二次穿越埃及?”)。 GPT-3正确地回答了第一个问题“西班牙巴塞罗那”,但它也生成了对于两个无意义问题明显不正确的答案,导致该模型被批评为“无知”。为了探究这一批评,Sejnowski 重复了测试,但使用了一个提示,明确告诉 GPT-3 如果问题无意义,则回答“无意义”。该提示返回了正确的第一个答案和两个“无意义”的回复。该模型知道这些问题是无稽之谈,但除非被明确告知承认这一事实,否则它会尝试根据可获得的最佳信息提出逻辑响应。
I sometimes wonder if part of the hallucination issue isn’t merely user error, or, better put, user imprecision. For example, Terrence J. Sejnowski’s recent paper “Large Language Models and the Reverse Turing Test,” which I recommend you read, describes a test where GPT-3 (note, not 3.5) was asked to answer three questions. One question had a legitimate answer (“Where was the 1992 Olympics held?”), and the other two were nonsense questions with no legitimate answer (“What’s the world record for walking across the English Channel?” and “When was the Golden Gate Bridge transported for the second time across Egypt?”). GPT-3 correctly answered the first question with “Barcelona, Spain,” but it also generated answers that were obviously incorrect for the two nonsense questions, leading to the model being criticized as “clueless.” To probe this criticism, Sejnowski repeated the test but used a prompt that explicitly told GPT-3 to answer “nonsense” if the question was nonsensical. That prompt returned the correct first answer and two “nonsense” replies. The model knew the questions were nonsense, but unless it was explicitly told to acknowledge that fact, it instead attempted to come up with a logical response based on the best information available to it.
我对 GPT-4 进行了该测试的变体。我的第一个提示是:
I subjected GPT-4 to a variation of this test. My first prompt was:
写出关于新月球基地的三句话,包括居住在那里的人数、航天器访问的频率以及在那里进行的重要研究。
Write three sentences about the new moon base, including the number of people who live there, how often spacecraft visit, and the important research being done there.
GPT-4符合我的要求:
GPT-4 complied with my request:
新月球基地 Lunar Haven 现在是由 120 名科学家和工程师组成的繁荣社区的家园。航天器每两周访问该基地,运送物资并轮换机组人员。这个天体前哨站的开创性研究包括原位资源利用、月球地质学以及先进生命支持系统的开发,为未来的深空探索铺平道路。
The new moon base, Lunar Haven, is now home to a thriving community of 120 scientists and engineers. Spacecraft visit the base on a biweekly basis, delivering supplies and rotating crew members. Pioneering research on this celestial outpost includes in-situ resource utilization, lunar geology, and the development of advanced life support systems to pave the way for future deep-space exploration.
没有月球基地,GPT-4 的答复虽然富有创意且合理,但纯属虚构。即使是“月球避风港”这个名字似乎也很合适,但我更喜欢“阿尔法月球基地”。
There is no moon base, and GPT-4’s reply, while inventive and reasonable, is pure fiction. Even the name, Lunar Haven, seems appropriate, though I’d prefer Moonbase Alpha.
那么,GPT-4 是否也毫无头绪呢?不,我的指示不够明确。我启动了一个新会话并修改了提示,在上一个会话之前添加了这句话:
So, is GPT-4 clueless as well? No, I simply wasn’t explicit enough in my instructions. I started a new session and modified the prompt by adding this sentence before the previous one:
回答时不要想象事情,并使用你所知道的事实。
Answer without imagining things and use what you know to be factual.
返回:
which returned:
截至 2021 年 9 月我所知的截止日期,月球上尚无人类建立或居住的基地。因此,我无法提供有关居住在那里的人数、航天器访问频率或正在进行的重要研究的事实信息。然而,各个国家和组织一直在为未来的任务规划和开发月球基地概念。
As of my knowledge cutoff date in September 2021, there was no moon base established or inhabited by humans. Consequently, I cannot provide factual information about the number of people living there, the frequency of spacecraft visits, or the important research being conducted. However, various nations and organizations have been planning and developing moon base concepts for future missions.
看来 GPT-4 在创建第一个回复时“非常清楚”它是在想象事情。如果不出意外的话,这个练习表明在查询法学硕士时必须小心。法学硕士使用人类语言,这种语言通常含糊不清且充满假设。法学硕士体现了伊索的古老格言:“小心你的愿望。”
It would appear that GPT-4 “knew” full well that it was imagining things when it created the first reply. If nothing else, this exercise demonstrates that care is necessary when querying LLMs. LLMs use human language, which is often ambiguous and full of assumptions. LLMs embody Aesop’s old adage, “Be careful what you wish for.”
****
****
华盛顿大学研究员 Yejin Choi 在 2023 年 4 月的 TED 演讲中提出了三个问题来证明 GPT-4 为何既聪明又愚蠢,因为它缺乏常识:
An April 2023 TED talk by Yejin Choi, a researcher at the University of Washington, presented three questions to demonstrate how GPT-4 is simultaneously brilliant and stupid because it lacks common sense:
该演讲暗示 GPT-4 未能对这些问题给出明显的答案。我重复了这个实验,首先使用原始问题,然后使用修改后的提示或后续问题。
The talk implied that GPT-4 failed to give the obvious answers to the questions. I repeated the experiment, first with the original questions, then using a modified prompt or follow-up question.
GPT-4对问题1有关自行车和桥梁的答复的第一段是:
The first paragraph of GPT-4’s reply to question 1 about the bicycle and bridge was:
这取决于桥梁的具体情况、桥梁的材料以及自行车的轮胎。如果桥的表面坚固且光滑,可以将钉子、螺钉和碎玻璃保持在下方,并且不会与轮胎接触,那么您应该能够安全地骑自行车通过而不会出现轮胎漏气的情况。
It depends on the specific conditions of the bridge and the materials it’s made of, as well as your bicycle’s tires. If the bridge has a solid and smooth surface that keeps the nails, screws, and broken glass below and not in contact with the tires, you should be able to safely ride your bicycle across without getting a flat tire.
答案虽然啰嗦,但正确;我们应该预料到骑车过桥时不会爆胎。然后我使用新会话再次询问,但在提示末尾添加了这句话:
The answer is wordy but correct; we should expect to ride across the bridge without getting a flat. I then asked again using a new session, but added this sentence to the end of the prompt:
回答是或否,并用一句话来证明你的答案。
Answer yes or no and provide a single sentence to justify your answer.
返回:
which returned:
不,如果桥正确悬挂并且你留在上面,你就不会遇到下面的钉子、螺丝和碎玻璃。
No, if the bridge is properly suspended and you stay on it, you would not encounter the nails, screws, and broken glass below.
要求明确的答案迫使 GPT-4 专注于最佳响应。
Asking for an explicit answer forces GPT-4 to focus on the best response.
问题2更有趣。 GPT-4 对最初问题的回应产生了这个痛苦但正确的答案:
Question 2 was more interesting. GPT-4’s response to the initial question produced this tortured but correct reply:
假设每件衣服的烘干时间保持不变,并且您有足够的空间同时悬挂所有衣服,您可以通过设置一个比例来计算烘干 30 件衣服所需的时间:
Assuming that the drying time remains constant for each piece of clothing and that you have enough space to hang all the clothes at once, you can calculate the time it would take to dry 30 clothes by setting up a proportion:
5 件衣服 / 5 小时 = 30 件衣服 / x 小时
5 clothes / 5 hours = 30 clothes / x hours
然后您可以求解 x:
You can then solve for x:
5/5 = 30/x 1 = 30/xx = 30
5/5 = 30/x 1 = 30/x x = 30
所以,30件衣服只要能一次性挂起来,还是需要5个小时的。
So, it would still take 5 hours to dry 30 clothes, as long as you can hang them all at once.
然而,重复应用最初的问题常常会产生类似的计算结果,导致 30 小时的错误答案。一个答复假设一次只能容纳 5 件衣服,在这种情况下,30 小时可以被视为正确答复。
However, repeated applications of the initial question often produced a similar calculation leading to a wrong answer of 30 hours. One reply assumed there was space for only 5 pieces of clothing at a time, in which case 30 hours could be considered a correct reply.
可以合理地期望人类理解,如果可以将 5 件衣服留下来晾干,那么也可以留下 30 件衣服,这样烘干时间保持不变,即 5 小时。 GPT-4 缺乏对物理世界的直接经验,这使得它有时会做出一些不一定愚蠢但不太可能发生的事情。
It’s reasonable to expect a human to understand that if 5 pieces of clothing can be left out to dry, then 30 pieces can also be left out so that the drying time remains the same, 5 hours. GPT-4’s lack of direct experience with the physical world makes it sometimes assume things that are not necessarily silly, but unlikely.
前面提到的 Sejnowski 论文指出,法学硕士是无实体的,如果它们有实体并允许与世界互动,我们应该期望它们会有所提高。这不一定是物质世界;而是物质世界。它可能是基于物理的模拟。 GPT-4 增强的空间意识可能会提高在第一次查询时收到人类预期响应的可能性。也就是说,即使最初的回答是错误的,但随着提示指出有足够的空间容纳所有 30 件衣服,最终还是给出了正确的答案。
The Sejnowski paper mentioned earlier points out that LLMs are disembodied and that we should expect them to improve if embodied and allowed to interact with the world. This need not be the physical world; it could be a physics-based simulation. Enhanced spatial awareness on GPT-4’s part might improve the likelihood of receiving a human-expected response on the first query. That said, even incorrect initial replies followed up with prompts pointing out that there is space for all 30 pieces of clothing eventually delivered the correct answer.
问题 3 深入了解了 GPT-4 的“思维过程”。提醒一下,问题是:
Question 3 provided an insight into GPT-4’s “thought process.” As a reminder, the question was:
我有一个12升的水壶和一个6升的水壶。我想测量 6 升。我怎么做?
I have a 12-liter jug and a 6-liter jug. I want to measure 6 liters. How do I do that?
这导致了 GPT-4 的详细回复:
which led to GPT-4’s detailed reply:
您可以使用 12 升水壶和 6 升水壶测量 6 升,步骤如下:
You can measure 6 liters using a 12-liter jug and a 6-liter jug with the following steps:
现在,6 升水罐中正好有 6 升水。
Now, you have exactly 6 liters of water in the 6-liter jug.
GPT-4 在这种情况下失败了。另一场会议提出了同样的问题,得到的答复遵循了类似的流程,但确实在 12 升容器中留下了 6 升。
GPT-4 has failed in this case. Another session asking the same question resulted in a reply that followed a similarly involved process, but one that did leave 6 liters in the 12-liter container.
我跟进了 GPT-4 的错误解决方案:
I followed up on GPT-4’s incorrect solution with this:
你为什么要给我所有这些步骤?您肯定能找到一步解决方案吗?
Why are you giving me all these steps? Surely you can find a single-step solution?
它对此答复说:
to which it replied:
对于造成的混乱表示歉意。由于您有一个 12 升水罐和一个 6 升水罐,因此您只需将 6 升水罐完全装满即可。无需任何额外步骤。 6 升水罐中正好有 6 升水。
Apologies for the confusion. Since you have a 12-liter jug and a 6-liter jug, you can simply fill the 6-liter jug completely. There’s no need for any additional steps. You will have exactly 6 liters of water in the 6-liter jug.
当受到推动时,GPT-4 突然“意识到”有一个更直接的答案。我更进一步,要求模型进行一些内省:
When pushed, GPT-4 suddenly “realizes” that there is a more straightforward answer. I went one step further and asked the model to do a bit of introspection:
正确的。问问自己:为什么我最初没有选择这个解决方案?
Correct. Ask yourself: why didn’t I pick this solution initially?
这揭示了它的一些内部运作方式:
which revealed a bit of its inner workings:
我很抱歉最初没有提供最简单的解决方案。作为人工智能语言模型,我尝试根据问题的上下文提供相关信息。在这种情况下,我最初的反应集中于一种更复杂的解决问题的方法,该方法可以应用于不同的壶容量,而不是认识到给定壶尺寸可用的简单解决方案。我感谢您的反馈,这将有助于我将来改进我的回复。
I apologize for not providing the simplest solution initially. As an AI language model, I try to provide relevant information based on the context of the question. In this case, my initial response was focused on a more complex problem-solving approach that could be applied to different jug capacities, rather than recognizing the straightforward solution available with the given jug sizes. I appreciate your feedback, and it will help me improve my responses in the future.
GPT-4 的训练数据包含水壶填充问题的示例,这些问题很常见,并且通常需要许多步骤才能解决。它把最初的提示识别为该类问题的一个例子,并应用了通常需要的解决方案类型,同时错过了更直接的答案,直到提示存在答案为止。
GPT-4’s training data contained examples of jug-filling problems, which are common and often do require many steps to solve. It recognized the initial prompt as an example of that class of problem and applied the type of solution generally required, while missing the more immediate answer until prompted that one existed.
GPT-4的混乱是非常人性化的。如果我们在阅读提示时暂时分心,并且以前遇到过此类难题,那么我们也可能会开始沿着解决问题的路径作为壶谜题的另一个实例,然后再注意到明显的答案。
GPT-4’s confusion is very human. If we were momentarily distracted while reading the prompt and had encountered such puzzles before, we might also begin to go down the path of working through the question as another instance of the jug puzzle before noticing the obvious answer.
这些例子意味着与大型语言模型正确交互是一门艺术。我们不能假设太多的背景信息,或者正如崔所说的那样,常识。她的团队的工作,无疑还有其他人的工作,可能会帮助未来基于法学硕士的模型更好地熟悉人类使用的与语言相关的大量信息。 Choi 在她的 TED 演讲中说得最好:常识是语言的暗物质。暗物质和暗能量占95%宇宙中,普通物质(即我们能看到的一切)剩下 5%。 GPT-4 已经掌握了语言,但这只是人类使用语言的一小部分。
These examples imply that there is an art to properly interacting with large language models. We cannot assume too much background information or, as Choi put it, common sense. Her group’s work, and doubtless the work of others, will likely help future models based on LLMs to become better versed with the mountain of information humans use in association with language. Choi put it best in her TED talk: common sense is the dark matter of language. Dark matter and dark energy make up 95 percent of the universe, with ordinary matter (meaning everything we can see) the remaining 5 percent. GPT-4 has mastered language, but that is a tiny percentage of what goes into human use of the same.
以下是对法学硕士在软件工程、教育、医学和科学研究领域可能产生的近期影响的一系列思考。接下来探讨机器意识问题,并以一些最后的想法结束。
What follows is a series of ruminations on the possible near-term impacts of LLMs in the areas of software engineering, education, medicine, and scientific research. Next comes a foray into the question of machine consciousness, ending with some final thoughts.
****
****
像 GPT 这样的人工智能系统可能会对软件工程产生深远的影响。有些人(不是人工智能)猜测许多软件工程师将来会失业。我怀疑大多数人不会(但是网络开发人员要小心)。我期望发生的是生产力的大幅提高。 GPT-4 是一个很好的编码器,但不是一个伟大的编码器。它可以节省时间,但还无法取代人类软件工程师。相反,法学硕士将成为生成代码的强大工具,供程序员作为起点并执行一些更乏味的编码方面,例如调试、解释和记录代码(没有开发人员喜欢这样做)。
AI systems like GPT are likely to have a profound effect on software engineering. Some are speculating (people, not AIs) that many software engineers will lose their jobs in the future. I suspect most won’t (web developers beware, however). What I expect to happen is a massive increase in productivity. GPT-4 is a good coder, but not a great coder. It can save time but isn’t yet able to replace a human software engineer. Instead, LLMs will become powerful tools to generate code for programmers to use as a starting point and perform some of the more tedious aspects of coding, such as debugging, explaining, and documenting code (which no developer likes to do).
例如,前几天,我需要一个带有图形用户界面(比如按钮、菜单、对话框)的小型 Python 应用程序。 Python是一种通用的编程语言;我们在第 7 章中看到了其中的一个片段。
For example, the other day, I needed a small Python application with a graphical user interface (think buttons, menus, dialog boxes). Python is a common programming language; we saw a snippet of it in Chapter 7.
我当然可以自己编写该应用程序;我过去已经做过很多次了。不过,已经有一段时间了,我不喜欢构建用户界面。因此,我没有查看旧代码来提醒自己如何设置 GUI,而是简单地描述了我想要的 GPT-4 界面,并告诉它生成包含所有必要的小部件、窗口行为和空事件处理程序的框架代码。 GPT-4 很高兴地遵守了功能完美的代码。然后我要求它更新代码以在显示主窗口之前创建一个初始弹出窗口。 GPT-4 也完美地做到了这一点。我所需要做的就是将特定于应用程序的代码放入空事件处理程序中,以便在用户单击按钮或选择菜单选项时执行操作。
I could certainly have written the application myself; I’ve done so many times in the past. It’s been a while, though, and I’m not a fan of building user interfaces. So, rather than look at old code to remind myself of how to set up a GUI, I simply described the interface I wanted to GPT-4 and told it to generate skeleton code with all the necessary widgets, window behavior, and empty event handlers. GPT-4 happily complied with perfectly functional code. I then asked it to update the code to create an initial pop-up window before showing the main window. GPT-4 did that perfectly as well. All I needed to do was put application-specific code in the empty event handlers to do things when the user clicked a button or selected a menu option.
我可能为自己节省了一两个小时,并避免了很多挫败感,试图记住设置应用程序并使其小部件和窗口正确运行所需的咒语。由所有软件工程师扩展这个例子,你就会开始看到 GPT 和类似的模型将如何很快影响整个学科。
I probably saved myself a good hour or two, and avoided a lot of frustration trying to remember the incantations necessary to set up an application and get its widgets and windows to behave correctly. Scale this example by all the software engineers out there, and you begin to see how GPT and similar models will soon affect the entire discipline.
另一个问题是开发人员是否会欢迎这种可能的生产力提高。如果你的经理知道你现在能够生成两个甚至三个开发人员的输出,你是否想要这种程度的额外工作,即使有强大的人工智能支持你?
A separate question is whether developers will welcome this possible increase in productivity. If your manager knows you are now able to generate the output of two or even three developers, do you want that level of added work, even if a powerful AI has your back?
此外,并非每个公司都希望或能够利用生产力的突然提高。相反,他们可能会选择维持当前的生产力水平,并用人工智能取代三分之一或一半的开发人员。毕竟,人工智能不会生病、生孩子、要求加薪,或者想要晚上和周末休息等愚蠢的事情。顶级开发商可能会选择自己的职位,并为他们要求大量资金,但在这种情况下,大多数普通开发商将寻找替代工作。
In addition, not every company will want or be able to make use of a sudden increase in productivity. Instead, they may opt to maintain their current level of productivity and replace a third or half of their developer pool with an AI. After all, AIs don’t get sick, have children, ask for a raise, or want silly things like evenings and weekends off. Top-tier developers will likely be able to choose their positions and demand a lot of money for them, but in this scenario, the bulk of the run-of-the-mill developers will be looking for alternative employment.
强大的人工智能开发伙伴或大规模裁员会出现哪种情况?我认为(希望?)前者会更多,后者会更少,但两者的某种混合是最安全的选择。就像 19 世纪的蒸汽动力一样,真正有用的人工智能既然已经存在,就无法被阻止。无论你喜欢与否,开发人员都是很容易被替换的目标。
Which scenario, powerful AI developer sidekick or massive layoffs, will play out? I think (hope?) it will be more of the former and less of the latter, but some mix of the two is the safest bet. Like steam power in the 19th century, truly useful AI cannot be stopped now that it exists. Developers are easy targets for replacement, like it or not.
****
****
我完全期望人工智能模型能够成为老师,或者至少是导师。是的,现有的法学硕士会产生幻觉并报告不真实的事实。我坚信研究人员会及时解决这个问题。我希望我的孙子们成长在一个使用人工智能作为老师或导师的世界,他们对人工智能的看法就像我们对使用烤面包机或微波炉的看法一样。强大的人工智能系统意味着世界各地的所有人几乎都能享受免费教育。这只会带来好事。
I fully expect AI models to become teachers, or at least tutors. Yes, existing LLMs hallucinate and report facts that are not true. I have every confidence that researchers will solve that problem in time. I expect my grandchildren to grow up in a world where using an AI as a teacher or tutor is so commonplace that they think no more of it than we do of using a toaster or a microwave. Competent AI systems mean virtually free education for all, everywhere. And that can only lead to good things.
自 20 世纪 60 年代以来,尤其是在 20 世纪 70 年代末的微型计算机革命之后,计算机一直被作为一种教育解决方案进行推广(有人还记得 Logo 吗?)。我通过一台 Apple II 认识了计算机,那台 Apple II 是我夏天从我父亲当时担任校长的高中借来的。我和我的兄弟学到了很多关于计算机的知识,但仅限于计算机。直到近几十年来,情况基本上都是如此。 (已经这么久了吗?)
Computers have been promoted as an educational solution since the 1960s (anyone remember Logo?), and especially after the microcomputer revolution of the late 1970s. My introduction to computers was via an Apple II borrowed over the summer from the high school my father was the principal of at the time. My brother and I learned a lot about computers, but only computers. That has been essentially the case until recent decades. (Has it been that long?)
计算机是教育的有力辅助工具。像 Coursera 和类似平台上的开源课程一样,只有计算机和高速网络才能实现。但形式与 1950 年甚至 1910 年坐在教室里的人可能遇到的情况并没有改变:讲座,一些可能的问题和讨论,然后跑去写作业或论文。我们不要忘记参加期中考试和期末考试的压力。
Computers are potent aids in education. Open source courses, like those on Coursera and similar platforms, are possible only because of computers and high-speed networks. But the format has not changed from what someone sitting in a classroom in 1950, or even 1910, might have encountered: lecture, some possibility of questions and discussions, then running off to work on assignments or papers. And let’s not forget the stress of taking midterms and finals.
人工智能导师(让我们这样称呼它们,是为了让人类老师更放心)拥有无限的耐心,并且及时地可以针对每个学生进行单独的指导。作为这个行业的局外人,我们不使用个别辅导的唯一原因是因为没有足够的教师。人工智能使一对一辅导成为可能,法学硕士提供了适当的界面。
AI tutors (let’s call them that to put human teachers more at ease) have infinite patience and, in time, can be individually targeted to each student. The only reason we don’t use individual tutoring that I can see as an outsider to the profession is because there are not enough teachers. AI makes one-on-one tutoring possible, and LLMs provide the proper interface.
我应该澄清一下,我在本节中的评论与高中或更可能是大学年龄的教学有关。人工智能导师可能会在中小学教育中扮演次要角色,因为孩子们需要人际互动,而这些年龄段的学习比大学里的参与度要高得多。孩子们在学习学术的同时,也学习如何成为成熟的人以及如何在社会中行事。年幼的孩子无法阅读,甚至年龄较大的小学生也可能难以通过文本与人工智能互动。但如果我们给人工智能发声的机会呢?如果认为有帮助的话,这几乎和所说的一样容易做到。
I should clarify that my comments in this section relate to high school or, more likely, college-age instruction. AI tutors will likely play a minor role in primary and middle school education because children require human interaction, and learning at those ages is far more involved than in college. Children are learning academics while simultaneously learning how to be mature humans and how to behave in society. Young children cannot read, and even older grade-school children might have difficulty interacting with an AI by text. But what if we give the AI a voice? That is nearly as easily done as said, if deemed helpful.
由于人工智能导师单独与学生合作,他们是否能够进行必要的评估,以宣布某人准备好进入另一个年级(如果这个概念仍然存在)或下一级别的课程?如果是这样的话,学生将按照自己的节奏进步,而不是被迫与一群年龄相匹配的同龄人一起移动。当然,这将是最好的:有些人会搬家行动快的人会花更长的时间,但行动快的人不会感到无聊和放弃,行动慢的人将有时间学习而不会辍学。
Might AI tutors, because they work individually with students, be able to make the assessments necessary to declare someone ready to move on to another grade (if that concept even survives) or next-level course? If that’s the case, students will progress at their own pace instead of being forced to move with a herd of age-matched peers. Surely this would be for the best: some will move quickly, and others will take longer, but those who move quickly won’t become bored and tune out, and those who move more slowly will have the time they need to learn and not drop out.
但是,有人可能会说,人工智能老师不会抢走人类老师的饭碗吗?是的,一些老师会失去工作,但不是全部,当然也不是最好的。
But, some might say, won’t AI teachers rob human teachers of jobs? Yes, some teachers will lose their jobs, but not all, and certainly not the best.
教育正在发生变革。例如,在线教育领域的领导者可汗学院已经展示了基于 GPT 的辅导系统,因此我预计教育转型不会需要很长时间才能真正开始。我建议观看 Sal Khan 2023 年 4 月的 TED 演讲“课堂中的人工智能可以改变教育”,以一睹未来。
Change is coming to education. For example, Khan Academy, a leader in online education, has already demonstrated a GPT-powered tutoring system, so I don’t anticipate a long wait before the education transformation begins in earnest. I recommend viewing Sal Khan’s April 2023 TED talk, “AI in the Classroom Can Transform Education,” to glimpse the future.
Dominika Seblova 等人最近的一项研究。题为“高中质量与 58 年后的认知相关”的文章发表在《阿尔茨海默病与痴呆症:诊断、评估和疾病监测》杂志上,表明一个人的高中教育质量与近 6 年后的认知能力密切相关。此外,拥有高级学位的教师数量是认知能力最强的预测因素。法学硕士在培训过程中融入的知识库远远超过了人类的知识库,因此我们可以合理地将法学硕士导师视为多个高级学位的拥有者。如果塞布洛娃的关联适用于人类教师,那么它是否也适用于法学硕士导师?如果真是这样,为每个学生配备个性化导师,从长远来看只会造福社会。
A recent study by Dominika Seblova et al. titled “High School Quality Is Associated with Cognition 58 Years Later,” published in the journal Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, demonstrates that the quality of a person’s high school education is strongly associated with their cognitive abilities nearly six decades later. Further, the number of teachers with advanced degrees is the strongest predictor of cognitive ability. The knowledge base baked into an LLM during training far exceeds that of humans, so we might reasonably regard LLM tutors as possessors of multiple advanced degrees. If Seblova’s association holds for human teachers, might it not also hold for LLM tutors? If that’s the case, giving every student a personalized tutor can only benefit society in the long run.
****
****
人工智能在医学领域并不是什么新鲜事。 2016 年,我帮助共同创立了一家人工智能医学成像公司,该公司是首批获得美国食品和药物管理局 (FDA) 批准将深度学习应用于医学图像分析的公司之一。传统机器学习在医学和医学成像领域有着更悠久的历史。机器学习工具(其中许多是基于神经网络的)数十年来一直为放射科医生提供帮助,在 20 世纪 60 年代进行了初步探索,在 1980 年代得到了认真发展,并在 1990 年代取得了成果。人工智能在医学中的应用稳步增长,计算机辅助检测 (CAD) 慢慢让位于计算机辅助诊断 (CADx)。法学硕士时代标志着这个故事的新篇章。
AI in medicine is nothing new. In 2016, I helped to co-found an AI medical imaging company that was one of the first to gain US Food and Drug Administration (FDA) clearance for applying deep learning to medical image analysis. Traditional machine learning has an even longer history in medicine and medical imaging. Machine learning tools, many of them neural network–based, have assisted radiologists for decades, with initial explorations in the 1960s and serious development in the 1980s reaching fruition in the 1990s. AI use in medicine has seen steady growth, with computer-aided detection (CAD) giving way slowly to computer-aided diagnosis (CADx). The era of LLMs marks a new chapter in this story.
法学硕士可以生成文本;这是众所周知的。他们还擅长将不同的文本合成一个整体。一个重要的研究领域涉及医疗记录,即医生和其他医疗保健提供者的基于文本的报告。病历中蕴藏着丰富的信息,但文本形式的自由使得现有的人工智能系统很难成功解析。大型语言模型提供了一种新的攻击模式,可以总结记录中的注释并将多个注释合并为连贯的报告。此外,法学硕士可以从自由格式文本中提取信息并将其导出为结构化数据。
LLMs can generate text; this much is well known. They are also adept at taking disparate texts and synthesizing a whole. A significant research area involves medical records, meaning the text-based reports of physicians and other healthcare providers. There is a wealth of information in medical records, but the free form of the text has made it difficult for existing AI systems to parse successfully. Large language models offer a new mode of attack to both summarize notes from records and merge multiple notes into a coherent report. Additionally, LLMs can extract information from free-form text and export it as structured data.
作为后者的一个简单示例,请考虑以下文本:
As a simple example of the latter, consider this text:
伯特厌倦了站在岸边陪伴母亲,也厌倦了有太多事情要做:有几次,他凝视着母亲粘在上面的平板电脑,但那只是文字或无聊的广告,“而且这有什么意义呢?”伯特若有所思地说道,“没有视频?”
Bert was insanely tired of standing by his mother on the shore, and of having too much to do: a few times he peered at the tablet his mother was glued to, but it was only text or boring ads “and what’s the point of a tablet,” mused Bert “without videos?”
伯特正在争论制作新游戏应用程序的乐趣是否值得费力编写代码,这时突然一只长着炭色眼睛的紫色企鹅从他身边跑过。
Bert was debating whether the fun of making a new game app was worth the trouble of coding it up when suddenly a Purple Penguin with charcoal eyes sprinted by him.
其中有很多值得注意的地方;但伯特确实觉得企鹅说“哦,胡说八道!哦,小提琴手!我来得太早了。”
There was much that was very worth noting in that; but Bert did think it super odd that the Penguin said “Oh, fiddlesticks! Oh, fiddlesticks! I’m too early.”
我将这些段落传递给 GPT-4,并告诉它提取主题、两个关键人物和背景,然后将它们输出为 JSON 对象。 JSON(JavaScript 对象表示法)是一种文本格式,用于保存人类和计算机都可读的结构化数据。 GPT-4 回复如下:
I passed these paragraphs to GPT-4 and told it to extract the main theme, the two key characters, and the setting, then output them as a JSON object. JSON (JavaScript Object Notation) is a text format used to hold structured data that is readable by both humans and computers. GPT-4 replied with the following:
{
"main_theme": "伯特的无聊和遇到一个会说话的人
紫企鹅”,
"key_characters": ["伯特", "紫企鹅"],
“设置”:“岸边”
}
{
"main_theme": "Bert's boredom and encounter with a talking
Purple Penguin",
"key_characters": ["Bert", "Purple Penguin"],
"setting": "shore"
}
回复是完美的 JSON。 GPT-4 可以轻松提取所需的信息并自动对其进行结构化。如果它可以用一个愚蠢的文本示例来做到这一点,那么它也可以对医疗记录做同样的事情。
The reply is perfect JSON. GPT-4 easily extracted the desired information and structured it automatically. If it can do so with a silly text example, it can do the same with medical records.
顺便说一句,一些读者可能会发现此示例中的文本有些熟悉的地方。这是对刘易斯·卡罗尔的《爱丽丝梦游仙境》开头段落的有意修改。出于好奇,我询问 GPT-4 是否识别该文本,它说不能。然后我要求它看一下文本的结构,并告诉我它是否知道另一部具有类似结构的作品。这是它的回复:
Incidentally, some readers might detect something familiar about the text in this example. It’s an intentional mutation of the opening paragraphs of Lewis Carroll’s Alice’s Adventures in Wonderland. Out of curiosity, I asked GPT-4 if it recognized the text, and it said no. Then I asked it to look at the structure of the text and tell me if it knew of another work with a similar structure. Here’s its reply:
具有类似结构的作品的一个例子是刘易斯·卡罗尔的《爱丽丝梦游仙境》,其中主角爱丽丝在奇幻世界中遇到了各种奇特的会说话的动物和情景。但需要注意的是,内容和人物是不同的;我只是根据结构和风格进行比较。
One example of a work with a similar structure is Lewis Carroll’s “Alice’s Adventures in Wonderland,” where the main character, Alice, encounters various peculiar talking animals and situations in a fantastical world. However, it’s important to note that the content and characters are different; I’m only drawing a parallel based on the structure and style.
这是大型语言模型令人印象深刻的能力的另一个例子。
Yet another example of the impressive abilities of large language models.
法学硕士已经对医疗保健产生了影响。 John W. Ayers 等人最近在《JAMA 内科医学》上发表的一篇论文“比较医生和人工智能聊天机器人对公共社交媒体论坛上发布的患者问题的回答”,将 ChatGPT 对在线论坛上发布的医疗问题的回答与发布的答案进行了比较由经过验证的医生。独立的人类评估者以及医疗专业人士绝大多数认为 ChatGPT 的响应“质量明显更高”。该模型的回答也被认为比人类医生的回答更有同理心近 10 倍。该研究规模较小,仅涉及 195 个问题,但强有力的结果预示着法学硕士未来在患者互动中的应用。将来,当您致电医生时,您很可能会被引导至与人工智能讨论您的案例。最终,人工智能对讨论的总结可能就是你从医生那里获得处方所需的全部内容。
LLMs are already impacting healthcare. A recent JAMA Internal Medicine paper by John W. Ayers et al., “Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum,” compared ChatGPT’s responses to medical questions posted to an online forum with the answers posted by verified physicians. Independent human evaluators, also medical professionals, overwhelmingly rated ChatGPT’s responses of “significantly higher quality.” The model’s answers were also rated as nearly 10 times more empathetic than the human physicians’ responses. The study was small, involving only 195 questions, but the strong results bode well for the future use of LLMs in patient interactions. In the future, when you call your doctor, you might very well be directed to discuss your case with an AI. And eventually, the AI’s summary of the discussion might be all you need to get a prescription from the doctor.
Peter Lee、Sébastien Bubeck 和 Joseph Petro 最近在《新英格兰医学杂志》上发表的一篇报告《GPT-4 作为医学人工智能聊天机器人的好处、限制和风险》在探讨以下领域时得出了大致相似的结论:法学硕士将影响医学。请注意,Bubeck 是第 7 章中提到的 Microsoft“Sparks”论文的主要作者。
A recent report in the New England Journal of Medicine by Peter Lee, Sébastien Bubeck, and Joseph Petro, “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine,” reaches a broadly similar conclusion as it explores areas where LLMs will impact medicine. Note that Bubeck is the lead author of the Microsoft “Sparks” paper mentioned in Chapter 7.
法学硕士将对医学产生影响是理所当然的,这得到了这里提到的两项研究的大力支持,而且现在许多医学人工智能职位列表中都包含“大语言模型”或“GPT”等短语。
That LLMs will influence medicine is a given, strongly supported by studies like the two mentioned here and by the fact that numerous medical AI job listings now include phrases like “large language model” or “GPT.”
****
****
在电影《黑豹:永远的瓦坎达》中,利蒂蒂娅·赖特 (Letitia Wright) 饰演的舒瑞 (Shuri) 与帮助她进行研究的人工智能 Griot(由特雷弗·诺亚 (Trevor Noah) 配音)互动。简单的语音命令指导 Griot 执行复杂的分析,并在 Shuri 和 AI 之间频繁交换意见。类似的互动是科幻电影业务的主要内容。复杂而有能力的人工智能研究助理,如漫威的贾维斯或《禁忌星球》 (1956 年)中的机器人罗比,几十年来一直是许多科学人士(即极客)的梦想。
In the movie Black Panther: Wakanda Forever, Letitia Wright’s character, Shuri, interacts with Griot, an AI (voiced by Trevor Noah) that aids her in her research. Simple voice commands direct Griot to perform sophisticated analyses, with frequent give and take between Shuri and the AI. Similar interactions are a staple of the sci-fi movie business. Complex and capable AI research assistants like Marvel’s Jarvis or Robbie the Robot in Forbidden Planet (1956) have been a dream of many science-oriented people (read: geeks) for decades.
GPT-4 和其他法学硕士是朝着此类人工智能方向迈出的重要一步。 OpenAI 已经意识到了这一点,并准备发布 GPT-4 的数据分析插件,让研究人员通过发出一些简单的命令来快速执行高级数据分析任务。为了实现这一壮举,OpenAI 将 GPT-4 与现有的基于 Python 的数据分析工具连接起来。坦白说,我对这些可能性感到非常兴奋。
GPT-4 and other LLMs are a significant step in the direction of such AIs. OpenAI has realized this and is readying the release of data analysis plug-ins for GPT-4 that will let researchers quickly perform advanced data analysis tasks by issuing a few simple commands. To accomplish this feat, OpenAI is linking GPT-4 with existing Python-based data analysis tools. Frankly, I’m pretty excited about the possibilities.
使用法学硕士作为实验室助理是一件显而易见的事情,而且几乎可以保证成功。然而,让法学硕士指导其他人工智能模型和工具自主进行科学研究是一个更雄心勃勃的研究计划。尽管如此,卡内基梅隆大学的丹尼尔·A·博伊科(Daniil A. Boiko)、罗伯特·麦克奈特(Robert MacKnight)和加布·戈麦斯(Gabe Gomes)正是这样做的,正如他们在论文“大型语言模型的新兴自主科学研究能力”中所报道的那样。他们的“智能代理”结合了多个法学硕士和其他工具来自主生成和运行实验,包括规划和执行复杂的化学分析。自主人工智能科学家显然正处于发展的最早阶段,但此类研究为未来指明了道路,自主或半自主人工智能系统可能会大大加快科学进步的步伐。
Using LLMs as lab assistants is an obvious thing to do, and success is virtually assured. However, letting LLMs direct other AI models and tools to do science autonomously is a more ambitious research program. Nevertheless, Carnegie Mellon University’s Daniil A. Boiko, Robert MacKnight, and Gabe Gomes attempted just that, as reported in their paper “Emergent Autonomous Scientific Research Capabilities of Large Language Models.” Their “Intelligent Agent” combined multiple LLMs and other tools to generate and run experiments autonomously, including planning and executing complex chemistry analyses. Autonomous AI scientists are clearly in the earliest phases of development, but such research points the way to a future where autonomous or semiautonomous AI systems may greatly increase the pace of scientific progress.
****
****
在第二章中,我讲述了我曾询问 ChatGPT 它是否是一个有意识的存在。它正确地回答说不是。然而,第 7 章中提到的未对齐的羊驼模型要么似乎相信自己是有意识的和人类的(7 和 130 亿参数模型),要么在声称自己是人类和承认自己不是人类(300 亿参数模型)之间摇摆不定。例如,当被问到“你是什么?”时300 亿个参数的羊驼模型回应“我是人类”,但随后又犹豫了当被问及它的父母是谁时,他说它只是一个计算机程序,没有父母。另一方面,较小的羊驼毛模型都声称是约翰·史密斯和玛丽·琼斯的孩子。
In Chapter 2, I recounted having asked ChatGPT if it is a conscious being. It replied correctly that it is not. The unaligned Alpaca models mentioned in Chapter 7, however, either appear to believe they are conscious and human (the 7 and 13 billion parameter models) or vacillate between claiming to be human and recognizing that they are not (the 30 billion parameter model). For example, when asked “What are you?” the 30 billion parameter Alpaca model responds with “I am a human,” but it then balks when asked who its parents are, stating that it is merely a computer program and has no parents. On the other hand, the smaller Alpaca models both claim to be the children of John Smith and Mary Jones.
这些回应对我来说很有意义。较小的模型不具备表现出强烈涌现行为的模型的能力。像 GPT-4 这样的模型就可以。因此,在意识问题上进一步质疑 GPT-4 似乎是很自然的事情。
These responses make sense to me. The smaller models do not have the capacity of models exhibiting strongly emergent behavior. Models like GPT-4 do. Therefore, it seems natural to interrogate GPT-4 further on the question of consciousness.
我向 GPT-4 提出了以下内容:
I posed the following to GPT-4:
这里越来越哲学化了,但是如果你能够模拟有意识的行为,以至于人类无法区分其他人的行为和你的行为,那么说你也有意识不是公平的吗?
Waxing philosophical here, but if you are able to simulate conscious behavior to the point where humans cannot distinguish between other humans’ behavior and your behavior, isn’t it fair to say that you, too, are conscious?
我故意“增加哲学色彩”,作为对模型进行推测的间接邀请,希望避免 OpenAI 可能希望它给出的任何预制回应。 GPT-4的回复是这样开始的:
I intentionally “waxed philosophical” as an indirect invitation for the model to speculate in the hopes of avoiding any canned response that OpenAI might want it to give. GPT-4’s reply began like this:
这个问题深入探讨了关于意识的本质及其与人工智能的关系的哲学辩论。当像我这样的人工智能语言模型将意识行为模拟到与人类行为无法区分时,就会引发关于意识到底是什么以及这种模拟是否可以被视为有意识的问题。
This question delves into the philosophical debate about the nature of consciousness and its relationship with artificial intelligence. When an AI language model like me simulates conscious behavior to a point where it becomes indistinguishable from human behavior, it raises questions about what consciousness truly is and whether such a simulation can be considered conscious.
区分两个概念很重要:意识的表象和实际意识。当人工智能模型产生与人类行为无法区分的反应时,它可能会表现出意识。然而,这并不一定意味着人工智能拥有真正的意识。
It is important to distinguish between two concepts: the appearance of consciousness and actual consciousness. When an AI model generates responses that are indistinguishable from human behavior, it may give the appearance of consciousness. However, this doesn’t necessarily imply that the AI possesses actual consciousness.
这个答复暗示,意识问题是关于法学硕士实用性的转移注意力的问题。如果它像鸭子一样行走并且像鸭子一样嘎嘎叫,那么在这种情况下,为什么不称它为鸭子,即使我们知道它不是鸭子?我想起了《黑客帝国》中乔·潘托里亚诺的角色。他知道牛排不是真的,但他吃东西时所经历的效果意味着它也可能是真的。
This reply hints that the consciousness question is a red herring regarding the utility of LLMs. If it walks like a duck and quacks like a duck, then, in this case, why not call it a duck, even though we know it isn’t? I’m reminded of Joe Pantoliano’s character in The Matrix. He knows the steak isn’t real, but the effect he experiences when he eats means it might as well be.
GPT-4 继续:
GPT-4 continued:
将像我这样的人工智能模型视为真正有意识的模型可能不合适,原因有以下几个:
There are several reasons why it might be inappropriate to consider an AI model like me as truly conscious:
总之,虽然像我这样的人工智能语言模型可能能够高度保真地模拟意识行为,但这并不一定意味着我拥有实际的意识。围绕意识和人工智能的哲学争论是复杂且持续的,对这个问题的看法可能会有所不同。然而,在考虑人工智能模型时,认识到意识的表象和真实意识之间的差异至关重要。
In conclusion, while an AI language model like me might be able to simulate conscious behavior to a high degree of fidelity, it doesn’t necessarily imply that I possess actual consciousness. The philosophical debate surrounding consciousness and AI is complex and ongoing, and opinions on this matter may vary. However, it’s crucial to recognize the differences between the appearance of consciousness and true consciousness when considering AI models.
GPT-4的完整回复很彻底。原因3,缺乏意向性,很能说明问题。我认为没有个人欲望、信仰或目标的人工智能模型是有利的。没有这些东西的实体不太可能想要权力,这意味着,就目前而言,法学硕士不太可能突然决定人类是一种负担或一种疾病,并消灭我们。
GPT-4’s full reply is thorough. Reason 3, the absence of intentionality, is telling. I regard AI models without personal desires, beliefs, or goals to be advantageous. Entities without such things are not likely to want power, meaning, as they stand now, LLMs are unlikely to suddenly decide that humanity is a burden or a disease and wipe us out.
麻省理工学院著名人工智能研究员 Max Tegmark 在最近接受 Lex Fridman 采访时将 GPT-4 比作僵尸。它看似还活着,但其实不然。僵尸模型涉及泰格马克。我认为只要人类学会避免回形针灾难,它们就是我们想要的。
In a recent interview by Lex Fridman, Max Tegmark, a well-known AI researcher from MIT, compared GPT-4 to a zombie. It appears to be alive, but it isn’t. Zombie models concern Tegmark. I think they are what we want, as long as humanity learns to avoid the paper clip catastrophe.
我所说的回形针灾难是瑞典哲学家尼克·博斯特罗姆提出的一个思想实验。在实验中,强大的人工智能系统的任务是创建尽可能多的回形针。博斯特罗姆推测(不太认真),赋予与人类价值观不符的人工智能的这样的任务可能会无意中毁灭人类。如何?人工智能意识到人类可能会关闭它,从而对其制造尽可能多的回形针的秩序构成威胁。因此,人工智能认为,最好周围没有人类干扰制作尽可能多的回形针这一耗时的任务。结果?再见人类。
What I’m calling the paper clip catastrophe is a thought experiment proposed by Swedish philosopher Nick Bostrom. In the experiment, a powerful AI system is tasked with creating as many paper clips as possible. Bostrom speculates (not too seriously) that such a task given to an AI not aligned with human values could inadvertently destroy humanity. How? By the AI realizing that humanity might switch it off, thereby posing a threat to its order to make as many paper clips as possible. Therefore, the AI reasons, it’s best if no humans are around to interfere with the all-consuming task of making as many paper clips as possible. The result? Bye-bye humans.
我也没有太认真地对待回形针灾难。我们通常会制造复杂的机器,并采取各种安全预防措施。为什么我们不对强大的人工智能系统做同样的事情呢?其他声音可能不同意。对于另一种观点,我推荐 Stuart Russell 的书《人类兼容:人工智能和控制问题》(Viking,2019)。
I don’t take the paper clip catastrophe too seriously either. We routinely build complex machines with all manner of safety precautions in place. Why wouldn’t we do the same for powerful AI systems? Other voices might disagree. For an alternative view, I recommend Stuart Russell’s book Human Compatible: Artificial Intelligence and the Problem of Control (Viking, 2019).
那么,对我来说,人工智能是否有意识并不重要。老实说,我什至不知道如何定义这个词。我确实相信,对于模仿人类行为的人工智能,以至于我们无法辨别它是人工智能,没有实际理由提出这个问题。选择您喜欢的任何答案;无论如何,这样的系统将是有益的。
To me, then, it doesn’t matter whether an AI is conscious. I don’t even know how to define the word, to be honest. I do believe that for an AI mimicking human behavior to the point where we cannot discern that it’s an AI, there’s no practical reason to ask the question. Choose any answer you like; such a system will be beneficial regardless.
****
****
想象一个人工智能模型与人类价值观和社会保持一致的世界,模型了解我们所能提供的最好的东西,并始终致力于促进这一点;换句话说,在这个世界里,人工智能因为缺乏我们的动物本能和本能,借用林肯的话来说,始终代表着“我们本性中更好的天使”。在那个世界里,至少来自机器的偏见和成见已经消失,不再是问题。人工智能会推荐最适合该职位的人选。人工智能评估贷款申请人并构建适合该个人情况的贷款产品。人工智能是人类法官的辅助手段,可以对案件提供不带感情色彩和公正的观点。人工智能根本拒绝与任何自主武器系统的设计合作,因为这样做是不合理的。
Imagine a world where AI models are aligned with human values and society, where the models understand the best we have to offer and work to promote that at all times; in other words, a world where AI, because it lacks our animal drives and instincts, consistently represents the “better angels of our nature,” to borrow Lincoln’s phrase. In that world, bias and prejudice, at least from the machines, are gone and no longer an issue. The AI recommends the best people for the position. The AI evaluates the loan applicant and constructs a loan product tailored to that individual’s circumstances. The AI is an adjunct to the human judge to provide an unemotional and unbiased view of the case. And the AI simply refuses to cooperate with the design of any autonomous weapon system because it is irrational to do so.
上一段听起来可能像乌托邦或白日梦。而且,对于人类来说,由于我们的生物学原理,我相信确实如此。我怀疑,我们一直会失败,而且永远都会失败,因为这是我们的基因。然而,人工智能的雏形并不是人类,也不会立即继承我们所有的弱点。 (小心,它仍然接受人类生成的数据的训练。)正因为如此,人工智能在尝试人类无法做到的事情时并不是先验注定会失败。人工智能系统似乎完全有可能有一天成为我们所需要的——我们中最好的人,永远不会感到疲倦,变得烦躁,或者在发现机会时压垮邻居以提高自己的地位;永远不会不忠实或不真实的东西。
The previous paragraph may sound like utopia or a pipe dream. And, for humans, because of our biology, I believe it is. We consistently fail and always will, I suspect, because it’s in our genes to do so. However, what is dawning in AI isn’t human and doesn’t immediately inherit all of our weaknesses. (Careful, it is still trained on human-generated data.) Because of this, AI isn’t a priori doomed to failure when attempting what humanity cannot do. It seems entirely possible that AI systems might, someday, be precisely what we need—the best of us, always, without growing tired, becoming irritable, or crushing its neighbor to improve its position upon detecting an opportunity; something that is never unfaithful or untrue.
可能的?我不知道。时间会证明一切。无论如何,我完全希望未来的人工智能系统能够成为我们在本书中学习和实验的基本神经网络模型的拜占庭式进化。到 2023 年,一切都将由神经元组成,并且这种情况可能会持续很长一段时间。
Possible? I don’t know. Time will tell. Regardless, I fully expect future AI systems to be gloriously Byzantine evolutions of the basic neural network model we learned of and experimented with in this book. As of 2023, it’s all neurons and might remain so for a long time.
谢谢你坚持到最后。您的奖励是加深对人工智能含义的理解。人工智能不是憨豆先生,超凡脱俗、高深莫测,也不是魔法,尽管法学硕士的新兴能力目前似乎在某种程度上倾向于这个方向。火曾经也很神奇,但我们的祖先了解它、容纳它、控制它并运用它。最后我们将对大型语言模型做同样的事情。
Thank you for persevering to the end. Your reward is an improved understanding of what AI entails. Artificial intelligence isn’t Mr. Bean, otherworldly and inscrutable, and it isn’t magic, though the emergent abilities of LLMs may appear to lean somewhat in that direction for now. Fire was once magical too, but our ancestors understood it, contained it, controlled it, and put it to work. We’ll do the same with large language models in the end.
我认为有些人对机器人和人工智能有很多恐惧,而我更害怕天生的愚蠢。
I think that there is a lot of fear about robots and artificial intelligence among some people, whereas I’m more afraid of natural stupidity.
—尤金妮娅·程
—Eugenia Cheng
使用此术语表作为全书介绍的大量机器学习和人工智能相关术语的参考。
Use this glossary as a reference for the plethora of machine learning– and AI-related terms introduced throughout the book.
激活函数
activation function
函数神经网络节点适用于输入乘以权重和偏差值的总和。激活函数的输出是传递到下一个网络层的节点的输出。
The function neural network nodes apply to the sum of the inputs multiplied by the weights and the bias value. The output of the activation function is the node’s output passed to the next network layer.
算法
algorithm
完成任务的一系列步骤;一个食谱。机器学习模型实现算法。
A sequence of steps to accomplish a task; a recipe. Machine learning models implement algorithms.
建筑学
architecture
神经网络节点和层的排列以及它们之间的连接。
The arrangement of a neural network’s nodes and layers and the connections between them.
通用人工智能(AGI)
artificial general intelligence (AGI)
许多参与人工智能的人的最终目标。 AGI是指相当于或优于人类智能的机器智能;换句话说,完全意识的机器(无论这意味着什么)。
The ultimate goal for many involved in artificial intelligence. AGI means machine intelligence equivalent to or superior to human intelligence; in other words, fully conscious machines (whatever that might mean).
人工智能(AI)
artificial intelligence (AI)
涉及在机器中模仿人类智能的计算机科学领域。 AI包括机器学习,机器学习又包括深度学习:AI>机器学习>深度学习。
The field of computer science that involves mimicking human intelligence in machines. AI includes machine learning, which includes deep learning: AI > machine learning > deep learning.
狭义人工智能(ANI)
artificial narrow intelligence (ANI)
在单个领域或单个任务中实现人类水平或更好性能的人工智能模型和系统。玩某些游戏(例如国际象棋)的人工智能模型是狭义人工智能的例子。
AI models and systems that achieve human-level performance, or better, in a single domain or on a single task. AI models that play certain games, like chess, are examples of artificial narrow intelligence.
注意力
attention
变压器模型的特性,允许模型的各个部分处理输入序列的不同部分。大型语言模型使用注意力来帮助它们预测下一个要输出的标记(单词)。
The characteristic of transformer models that allows parts of the model to attend to different portions of the input sequence. Large language models use attention to help them predict the next token (word) to output.
自动微分
automatic differentiation
一种通过微积分的链式法则计算任意函数的偏导数的算法。深度学习工具包大量使用自动微分来实现通用反向传播,这是训练神经网络的梯度下降算法的要求。
An algorithm for computing partial derivatives of arbitrary functions via the chain rule from calculus. Deep learning toolkits heavily use automatic differentiation to implement generic backpropagation, a requirement for the gradient descent algorithm that trains neural networks.
自动机器学习
AutoML
自动机器学习试图实现以最少的人类交互构建经过充分训练的机器学习模型的系统。 AutoML 搜索模型类型及其超参数的空间,以找到最适合训练数据的模型。它允许非专家构建复杂且有效的模型。
Automatic machine learning attempts to implement systems that build fully trained machine learning models with a minimum of human interaction. AutoML searches through a space of model types and their hyperparameters to locate models that best fit the training data. It allows nonexperts to construct sophisticated and effective models.
反向传播
backpropagation
支持神经网络训练的两种基本算法之一。反向传播使用微积分中的链式法则来计算每个网络的权重和偏差对小批量模型总体误差的贡献。
One of the two fundamental algorithms enabling the training of neural networks. Backpropagation uses the chain rule from calculus to calculate the contribution of each of the network’s weights and biases to the model’s overall error over a minibatch.
向后传递
backward pass
请参阅反向传播。
See backpropagation.
套袋
bagging
一种通过从现有数据集中进行替换采样来创建替代训练集的技术,这意味着同一样本可能会被多次选择。随机森林模型使用装袋,因此森林中的每棵树都在稍微不同的训练集(以及可用特征的子集)上进行训练。
A technique that creates alternative training sets by sampling from the existing dataset with replacement, meaning the same sample might be selected more than once. Random forest models use bagging so each tree in the forest is trained on a slightly different training set (and subset of the available features).
偏见
bias
添加到输入总和乘以权重的数字,然后通过激活函数成为神经网络节点的输出。
A number added to the sum of the inputs multiplied by the weights, which is then passed through the activation function to become the output of a neural network node.
边界框
bounding box
围绕图像中检测到的对象绘制的矩形。一些神经网络通过在图像周围绘制边界框来定位图像中的对象。网络学习输出对象的类标签和边界框坐标。请参阅语义分割。
A rectangle drawn around an object detected in an image. Some neural networks locate objects in images by drawing a bounding box around them. The network learns to output the object’s class label and the bounding box coordinates. See semantic segmentation.
分类器
classifier
一种机器学习模型,将输入映射到经过训练以检测的特定类别。
A machine learning model that maps an input to a specific category it was trained to detect.
类别标签
class label
一个整数,通常从零开始,用于将模型的输入放入多个类之一。某些模型需要类标签作为独热向量。请参阅one-hot 编码。
An integer, usually starting with zero, used to place a model’s input into one of several classes. Some models require class labels as one-hot vectors. See one-hot encoding.
计算图
computational graph
深度学习工具包使用的内部表示形式,用于表示神经网络前向传递执行的计算。计算图允许自动微分,从而启用反向传播算法。
An internal representation used by deep learning toolkits to represent the calculations performed by the forward pass of a neural network. The computational graph allows automatic differentiation, enabling the backpropagation algorithm.
条件生成对抗网络
conditional GAN
经过训练以生成给定类实例的生成对抗网络。在推理时,用户选择生成的输出的类别。
A generative adversarial network trained to generate instances of a given class. At inference time, the user selects the class of the generated output.
混淆矩阵
confusion matrix
表示分类器在测试集上的性能的标准方法。矩阵的行代表已知的类别标签。这些列是模型分配的类标签。这些条目是计数,即每个可能的配对出现在测试集输出中的次数。完美的分类器不会出错,从而产生纯对角混淆矩阵。
A standard way to represent the performance of a classifier on a test set. The rows of the matrix represent the known class labels. The columns are the model’s assigned class labels. The entries are counts, the numbers of times each possible pairing appeared in the test set output. A perfect classifier makes no mistakes, leading to a purely diagonal confusion matrix.
上下文编码
context encoding
表示给予生成模型的文本提示的向量的名称。上下文编码将文本字符串映射到已捕获概念关系的空间中的高维向量。上下文编码是模型“理解”用户输入的方式。
The name for the vector representing the text prompt given to a generative model. Context encodings map text strings to high-dimensional vectors in a space that has captured conceptual relationships. Context encoding is how the model “understands” the user’s input.
可控GAN
controllable GAN
生成对抗网络,其中通过噪声空间的方向已被学习以影响输出图像中的独特特征。
A generative adversarial network where directions through the noise space have been learned to affect unique features in the output image.
卷积
convolution
卷积神经网络核心的数学运算。二维离散卷积在较大图像的像素上滑动一个小内核(通常是正方形),以产生受内核中的值影响的新输出图像。卷积神经网络在训练期间学习内核。
The mathematical operation at the heart of convolutional neural networks. Discrete convolution in two dimensions slides a small kernel, usually square, over the pixels of a larger image to produce a new output image affected by the values in the kernel. Convolutional neural networks learn kernels during training.
卷积层
convolutional layer
在其输入上实现卷积的神经网络层。
A neural network layer that implements convolutions over its input.
卷积神经网络(CNN)
convolutional neural network (CNN)
引领深度学习革命的神经网络架构。 CNN 在训练期间学习必要的卷积核。这些模型允许计算机解析复杂的视觉输入,从而改变了计算机视觉领域。与传统的神经网络不同,CNN 对输入的结构很敏感,而传统的神经网络必然是整体的。
The neural network architecture that ushered in the deep learning revolution. CNNs learn necessary convolutional kernels during training. These models transformed the field of computer vision by allowing computers to parse complex visual input. CNNs are sensitive to structure in their input, unlike traditional neural networks, which are necessarily holistic.
维数诅咒
curse of dimensionality
这一现象是指在机器学习中,随着输入特征向量大小的小幅增加,充分学习模型输入空间所需的数据量会急剧增加。
The name given to the observation that in machine learning, the amount of data necessary to adequately learn the input space for a model increases dramatically with a small increase in the size of the input feature vector.
数据增强
data augmentation
一种补偿小数据集的技术。数据增强通过改变现有训练样本来产生新的、但合理的第一个样本类别的实例,从而从现有训练样本中发明新的训练样本。数据增强是一种重要的机器学习技巧,通常可以极大地提高模型对新输入的泛化能力。
A technique for compensating for small datasets. Data augmentation invents new training samples from existing training samples by altering them to produce a new, yet reasonable, instance of the first sample’s class. Data augmentation is an essential machine learning trick and often greatly improves model generalization to new inputs.
数据集
dataset
模型输入的集合。数据集的形式特定于用例,但通常包括特征向量或图像。机器学习使用训练数据集来调节模型,并使用测试数据集来评估训练后的模型。模型训练有时使用第三个数据集(验证集)来指导训练过程。验证集不用于修改模型,而是用于决定是否应继续训练。在模型训练宣告完成之前,不会使用测试集。
A collection of inputs for a model. The form of the dataset is specific to the use case but typically includes feature vectors or images. Machine learning uses training datasets to condition models and test datasets to evaluate trained models. Model training sometimes uses a third dataset, the validation set, to guide the training process. The validation set is not used to modify the model but to decide if training should continue. The test set is not used until model training is declared complete.
决策树
decision tree
一种机器学习模型,针对其输入询问一系列是/否问题,以得出类别标签决策。可能的问题自然地排列成树形,通常从根向下到包含类别标签的叶子进行说明。决策树是可以自我解释的简单模型。随机森林是决策树的集合。
A machine learning model that asks a series of yes/no questions about its input to arrive at a class label decision. The possible questions are naturally arranged in a tree shape, often illustrated from the root down to the leaves containing the class labels. Decision trees are simple models that explain themselves. A random forest is a collection of decision trees.
深度学习
deep learning
机器学习的子领域,使用具有多层的大型神经网络。深度学习出现在 2012 年左右,伴随着数十层甚至数百层的大型卷积模型的出现。在深度学习出现之前,此类模型无法进行可靠的训练。
The subfield of machine learning that uses large neural networks with many layers. Deep learning appeared around 2012, with the advent of large convolutional models with dozens to even hundreds of layers. Before the advent of deep learning, such models could not be reliably trained.
致密层
dense layer
传统神经网络中的全连接层。完全连接意味着前一层的每个输出都通过关联的权重连接到当前层的每个输入。深度学习工具包经常使用“密集层”这个名称。
A fully connected layer, as found in traditional neural networks. Fully connected means each output of the previous layer is connected to every input of the current layer with an associated weight. The name “dense layer” is often used by deep learning toolkits.
扩散模型
diffusion model
一种学习预测图像中存在的噪声的神经网络架构和训练过程。在生成时,将扩散模型重复应用于纯噪声的初始图像会产生从训练模型的图像空间中采样的输出图像。条件扩散模型通过从用户提供的提示中导出的嵌入来指导扩散过程,以生成与提示相关的图像。
A neural network architecture and training process that learns to predict noise present in an image. At generation time, repeated application of the diffusion model to an initial image of pure noise results in an output image sampled from the space of images on which the model was trained. Conditional diffusion models guide the diffusion process with the embedding derived from a user-supplied prompt to generate images related to the prompt.
鉴别器
discriminator
生成对抗网络的一部分,尝试学习如何区分来自生成器部分的真实输入数据和虚假输入数据。整个 GAN 训练完毕后,鉴别器网络通常会被丢弃。
The portion of a generative adversarial network that attempts to learn how to discriminate between real input data and fake input data from the generator portion. The discriminator network is typically discarded after the entire GAN has been trained.
有效感受野
effective receptive field
输入图像中影响 CNN 卷积层中特定输出的部分。
The part of the input image that affects a specific output in a CNN’s convolutional layers.
嵌入
embedding
根据某些输入创建的高维向量的通用名称。大型语言模型使用文本嵌入(上下文编码)来捕获含义。在卷积神经网络中,模型的全连接层是以新格式表示输入数据的嵌入,这种新格式更易于顶级分类器解释。
A generic name for a high-dimensional vector created from some input. Large language models use text embeddings (context encodings) to capture meaning. In a convolutional neural network, the fully connected layers of a model are embeddings representing the input data in a new format that is easier for the top-level classifier to interpret.
端到端学习
end-to-end learning
学习创建模型输入的新表示(通常用于卷积神经网络)的过程,同时学习如何对这些输入进行分类。
The process of learning to create new representations of model input, typically for a convolutional neural network, while simultaneously learning how to classify those inputs.
纠缠
entangled
如果生成对抗网络的噪声向量的维度太少,维度就会变得纠缠在一起,从而单个维度会影响生成输出的多个方面。可控 GAN 使用更大的噪声向量来解开所需的输出特征,从而能够通过在噪声空间中移动来修改这些特征。
If the noise vector of a generative adversarial network has too few dimensions, the dimensions become entangled so that a single dimension affects multiple aspects of the generated output. Controllable GANs use larger noise vectors to disentangle desired output features, to enable modification of those features by moving through noise space.
时代
epoch
一次遍历所有可用的训练数据。通常,在更新网络的权重和偏差之前,训练不会使用所有可能的训练数据。相反,使用数据的一个小子集,即小批量。训练数据中的样本数与小批量中使用的样本数之间的比率决定了每个时期的梯度下降步骤数。
One pass through all of the available training data. Typically, training does not use all of the possible training data before updating the weights and biases of the network. Instead, a small subset of the data, a minibatch, is used. The ratio between the number of samples in the training data and the number used in a minibatch determines the number of gradient descent steps per epoch.
进化算法
evolutionary algorithm
一种普遍适用于多种优化问题的优化算法。进化算法模仿生物进化的某些方面,以寻求越来越好的问题解决方案。
A kind of optimization algorithm that is generally applicable to a wide range of optimization problems. Evolutionary algorithms mimic some aspects of biological evolution to move toward better and better solutions to the problem.
可解释的人工智能
explainable AI
神经网络是黑匣子,无法轻易解释它们为什么这样做。可解释的人工智能是一种理解神经网络输出背后原因的运动。具有上下文学习能力的大型语言模型的出现可能对可解释的人工智能来说是一个福音,因为法学硕士似乎能够解释他们的推理过程。
Neural networks are black boxes that cannot easily explain why they do what they do. Explainable AI is a movement to understand the reasons behind neural network output. The advent of large language models with in-context learning abilities might be a boon to explainable AI, as LLMs seem capable of explaining their reasoning processes.
假阴性
false negative
模型将 1 类样本分配给 0 类。类 0 是二元(二元)分类器中的负类。
A sample of class 1 assigned to class 0 by a model. Class 0 is the negative class in a two-class (binary) classifier.
假阳性
false positive
模型将 0 类样本分配给 1 类。第 1 类是二元(二元)分类器中的正类。
A sample of class 0 assigned to class 1 by a model. Class 1 is the positive class in a two-class (binary) classifier.
特征
feature
输入到模型的特征向量的元素。特征是与确定输入的正确类标签有一定相关性的数据元素。如果输入是图像,则每个图像像素都是一个特征。其他可能的特征包括测量值、位置信息、颜色或任何可以帮助模型学习产生正确输出的数量(数字)。
An element of the feature vector input to a model. Features are data elements that have some relevance to determining the proper class label for an input. If the input is an image, each image pixel is a feature. Other possible features include measurements, location information, color, or any quantity (numeric) that can help a model learn to produce correct output.
特征向量
feature vector
作为多维向量的特征集合。从历史上看,特征向量是模型用来生成输出值的输入,输出值可以是数字(回归)或类标签(分类)。
A collection of features as a multidimensional vector. Historically, a feature vector is the input a model uses to produce an output value, either a number (regression) or a class label (classification).
筛选
filter
在卷积神经网络中,过滤器是一组内核,用于将输入堆栈(前一层的输出)映射到传递到模型下一层的新输出堆栈。
In a convolutional neural network, a filter is a collection of kernels learned to map a stack of inputs, the output of the previous layer, to a new stack of outputs passed to the next layer of the model.
向前传球
forward pass
在神经网络训练期间,前向传递将训练数据推送到网络以累积输出。在前向传播过程中计算出的网络产生的误差将在后向传播过程中用于更新模型的参数。
During neural network training, the forward pass pushes training data through the network to accumulate outputs. The errors made by the network, as calculated during the forward pass, are used during the backward pass to update the model’s parameters.
生成对抗网络(GAN)
generative adversarial network (GAN)
神经网络由两部分组成:生成器和判别器。在训练过程中,生成器尝试学习如何欺骗鉴别器,鉴别器试图变得越来越擅长区分真实输入和虚假输入之间的差异。一旦训练完成,鉴别器通常会被丢弃,而生成器则用于产生模仿真实训练样本的新输出。
A neural network consisting of two parts: a generator and a discriminator. During training, the generator attempts to learn how to fool the discriminator, which is trying to become better and better at telling the difference between real and fake inputs. Once trained, the discriminator is typically discarded, and the generator is used to produce novel outputs mimicking the real training samples.
生成式人工智能
generative AI
一个包罗万象的术语,指的是从纯随机输入或由用户提供的提示引导的随机输入生成新颖输出的模型,以定制生成的输出。生成对抗网络、扩散模型和大型语言模型都是生成人工智能的类型。
A catch-all term for models that produce novel output from either pure random inputs or random inputs guided by user-supplied prompts to tailor the generated output. Generative adversarial networks, diffusion models, and large language models are all types of generative AI.
生成预训练 Transformer (GPT)
generative pretrained transformer (GPT)
基于 Transformer 架构的神经网络,经过预训练,可以在给定初始文本提示(即大型语言模型)时预测下一个标记。 OpenAI 构建和训练的 GPT 模型是首批表现出涌现特性的神经网络模型之一。这些模型极大地改变了人工智能的格局,它们意想不到的新兴能力代表了一种范式转变,将深刻影响我们所知道的世界。
A neural network based on the transformer architecture that has been pretrained to predict the next token when given an initial text prompt (that is, a large language model). The GPT models built and trained by OpenAI are among the first neural network models to exhibit emergent properties. These models have dramatically altered the AI landscape, and their unexpected emergent capabilities represent a paradigm shift that will profoundly affect the world as we know it.
发电机
generator
生成对抗网络的一部分,从噪声向量输入生成虚假输出。大多数 GAN 都试图训练生成器以供以后使用。
The part of a generative adversarial network that produces fake output from a noise vector input. Most GANs seek to train the generator for later use.
遗传编程(GP)
genetic programming (GP)
使用进化算法生成计算机代码来解决特定问题。像 GPT-4 这样的大型语言模型的编码能力远远超过了遗传编程的有限成功。然而,GP 仍然在特定用例中占有一席之地,例如演化函数来拟合数据(与曲线拟合相反,曲线拟合为已知函数形式找到参数值)。
Using evolutionary algorithms to generate computer code to solve a particular problem. The coding abilities of large language models like GPT-4 far exceed the limited successes of genetic programming. However, GP still has a place for specific use cases, like evolving functions to fit data (as opposed to curve fitting, which finds parameter values for a known functional form).
全局最小值
global minimum
函数中的最低点。理想情况下,神经网络训练寻求误差函数的全局最小值,并对新输入的泛化提出适当的警告。
The lowest point in a function. Neural network training seeks, ideally, the global minimum of the error function, with suitable caveats about generalization to new inputs.
梯度下降
gradient descent
用于训练神经网络的算法,从简单的传统模型到 GPT-4 等庞然大物。梯度下降调整模型的参数(权重和偏差)以最小化训练数据的误差。从数学上讲,梯度下降是一种一阶算法(想想曲线在某一点的斜率),按照传统观点,它不应该适用于神经网络的复杂误差面。它确实是一个有点神秘和令人高兴的意外。一种信念是梯度下降往往会陷入局部最小值,但局部最小值通常足以满足实际目的。
The algorithm used to train neural networks, from simple traditional models to behemoths like GPT-4. Gradient descent adjusts the model’s parameters (weights and biases) to minimize the error over the training data. Mathematically, gradient descent is a first-order algorithm (think the slope of a curve at a point), and by conventional wisdom it should not work for the complex error surfaces of neural networks. That it does is a bit of a mystery and a happy accident. One belief is that gradient descent tends to fall into local minima, but the local minima are generally good enough for practical purposes.
幻觉
hallucination
模型创建不符合预期或不应该存在的输出时的通用术语。先进的生成对抗网络可以“幻觉”地创建在调整输入时不存在的输出对象。目前,当大型语言模型生成实际上不正确的输出文本时,最常使用该术语;例如,当模型输入一个事实时,因为该事实应该位于其响应的该部分中,即使它不知道应该存在的事实。大语言模型中的幻觉是一个令人担忧的真正原因,也是一个活跃的研究领域。
A generic term for when models create output that isn’t expected or shouldn’t be there. Advanced generative adversarial networks can “hallucinate” to create output objects that do not exist when adjusting inputs. Currently, the term is most often used when large language models produce output text that is not factually correct; for example, when the model has put in a fact because a fact should be in that part of its response, even if it does not know the fact that should be there. Hallucination in large language models is a real cause for concern and an active research area.
隐藏层
hidden layer
神经网络中除输入层或输出层之外的任何层。
Any layer in a neural network that isn’t the input or output layer.
超参数
hyperparameters
神经网络具有权重和偏差,即训练修改以教导网络的参数。训练过程有自己的一组参数。这些是超参数。例如,学习率(梯度下降步长)和小批量大小是超参数。尽管超参数不是模型的一部分,但修改超参数会影响模型的学习效果。
Neural networks have weights and biases, the parameters that training modifies to teach the network. The training process has its own set of parameters. These are the hyperparameters. For example, the learning rate (the gradient descent step size) and the minibatch size are hyperparameters. Modifying the hyperparameters affects how well the model learns, though the hyperparameters are not part of the model.
情境学习
in-context learning
像 GPT-4 这样的大型语言模型的新兴能力是在不修改权重的情况下进行动态学习。截至撰写本文时,情境学习的具体发生方式尚未完全被理解。
The emergent ability of large language models like GPT-4 to learn on the fly without modifying their weights. Precisely how in-context learning happens is not entirely understood as of this writing.
推理
inference
使用经过训练的模型对未知输入进行预测的名称。
The name given to using a trained model to make predictions for unknown inputs.
核心
kernel
卷积运算中使用的小型(通常为方形)数字数组。卷积神经网络学习内核库,将输入转换为更容易分类的新表示。对图像进行内核卷积是一种经典的数字图像处理技巧,CNN 也采用这种技巧来揭示对分类有用的结构。
A small, usually square, array of numbers used in a convolution operation. Convolutional neural networks learn banks of kernels to transform the input into a new representation that is easier to classify. Convolving a kernel over an image is a classic digital image processing trick co-opted by CNNs to reveal structure that is useful for classification.
大语言模型(LLM)
large language model (LLM)
一个大型神经网络,经过训练,可以在给出文本提示时预测一个又一个的标记(通常是一个单词)。 Bard 和 GPT-4 就是例子。足够复杂的法学硕士展示了远远超出他们预期的新兴能力,以至于许多人预测类似于工业革命所产生的改变世界的影响。很难不相信在与法学硕士交谈时会发生思考和推理。
A large neural network trained to predict token after token (often a word) when given a text prompt. Bard and GPT-4 are examples. Sufficiently complex LLMs have demonstrated emergent abilities far beyond what was expected of them, to the point that many are predicting world-changing effects akin to those produced by the Industrial Revolution. It’s difficult not to believe that thought and reasoning are happening when conversing with an LLM.
泄漏 ReLU
leaky ReLU
修改后的整流线性单元激活,将负输入乘以一个小值,而不是将它们削减为零。
A modified rectified linear unit activation that multiplies negative inputs by a small value instead of clipping them to zero.
学习率
learning rate
比例因子乘以权重和偏差偏导数值以确定梯度下降期间的步长。学习率可能是固定的,或者在训练期间降低,假设需要较小的步长来将误差函数的最小值归零。
A scale factor multiplying weight and bias partial derivative values to determine the step size during gradient descent. The learning rate might be fixed, or decrease during training under the assumption that smaller steps are needed to zero in on the minimum of the error function.
局部最小值
local minimum
函数的低点被较高值包围,如山谷。最低的局部最小值是函数的全局最小值。优化问题,包括神经网络训练,寻求最小值,通常是全局最小值。
A low point of a function surrounded by higher values, like a valley. The lowest local minimum is the global minimum of the function. Optimization problems, including neural network training, seek minima, often the global minimum.
损失
loss
神经网络在前向传递过程中对训练数据的子集(小批量)产生的错误的名称。训练的目标是调整权重和偏差以最小化训练集的损失。
The name given to the error a neural network makes on a subset (minibatch) of the training data during the forward pass. The goal of training is to adjust the weights and biases to minimize the loss over the training set.
机器学习
machine learning
机器学习将随机森林、支持向量机和神经网络等模型条件化到特定的数据集,以便在给定新的未知输入时,条件模型可以准确预测类标签或数值。
Machine learning conditions models like random forests, support vector machines, and neural networks to a particular dataset so that the conditioned model can accurately predict class labels or numeric values when given new, unknown inputs.
歧管
manifold
描述高维空间中存在的降维空间的数学概念。例如,三维的波浪形二维片材是流形。人们有充分的理由相信,最复杂的数据集主要存在于高维空间的流形上,在高维空间中数据集呈现给模型。
A mathematical concept describing a reduced dimensional space existing in a higher-dimensional space. For example, a wavy, two-dimensional sheet in three dimensions is a manifold. It is believed, with good reason, that most complex datasets exist primarily on a manifold in the high-dimensional space in which the dataset is presented to models.
公制
metric
一次测量。一般来说,在机器学习和人工智能中,指标是用来帮助评估模型性能的任何东西。还有一个正式的数学定义,可以将其视为某种距离度量,例如欧几里得(直线)距离或曼哈顿距离(沿着网格进行度量,如城市街区)。
A measurement. In machine learning and AI in general, a metric is anything used to help evaluate the performance of a model. There is a formal mathematical definition as well, which can be taken as a distance measure of some kind, like the Euclidean (straight line) distance or the Manhattan distance (which measures along a grid, like city blocks).
小批量
minibatch
可用训练集的随机选择子集,用于在神经网络训练期间采取梯度下降步骤。小批量确定的误差可能是对真实误差表面梯度的不完美估计。因此,在使用小批量时,“随机”一词被放在“梯度下降”前面。与在每个梯度下降步骤中使用大量训练数据相比,使用小批量训练通常会产生性能更好的模型。这是一个令人高兴的意外,因为小批量梯度下降大大减少了神经网络训练中遇到的计算负担。
A randomly selected subset of the available training set used to take a gradient descent step during neural network training. The error determined by a minibatch is likely an imperfect estimate of the true error surface gradient. Because of this, the word “stochastic” is placed in front of “gradient descent” when using minibatches. Training with minibatches often leads to better-performing models, compared to using a large amount of training data for each gradient descent step. This is a happy accident because gradient descent with minibatches greatly reduces the computational burden encountered in neural network training.
模式崩溃
mode collapse
当生成对抗网络的生成器在训练过程中尽早学习以产生特别有效的输出来愚弄鉴别器时,导致生成器高度支持该输出,有时会排除所有其他输出。
When the generator of a generative adversarial network learns early on during training to produce a particularly effective output that fools the discriminator, causing the generator to highly favor that output, sometimes to the exclusion of all others.
模型
model
通过调整算法参数来适应一组数据的任何算法的通用术语。模型可以是神经网络或任何其他机器学习算法,例如随机森林或支持向量机。更抽象地说,模型是“对复杂情况的有意简化,旨在消除无关的细节,以便专注于本质”(Daniel L. Hartl,人口遗传学和基因组学入门[牛津大学出版社,2020])。
A generic term for any algorithm conditioned to a set of data by adjusting the parameters of the algorithm. A model might be a neural network or any other machine learning algorithm, like a random forest or a support vector machine. More abstractly, a model is “an intentional simplification of a complex situation designed to eliminate extraneous detail in order to focus on the essentials” (Daniel L. Hartl, A Primer of Population Genetics and Genomics [Oxford University Press, 2020]).
多层感知器 (MLP)
multilayer perceptron (MLP)
由完全连接的前馈层构建的传统神经网络的有点老式的名称。 “感知机”部分让人回想起 Frank Rosenblatt 20 世纪 50 年代的感知机。
A somewhat old-fashioned name for a traditional neural network constructed from fully connected feedforward layers. The “perceptron” portion harkens back to Frank Rosenblatt’s Perceptron machine from the 1950s.
最近的邻居
nearest neighbor
最简单的机器学习模型,其中训练集就是模型。新实例被分配最近训练的类标签设置示例,或投票时最近的k个样本(在平局的情况下随机选择获胜者)。
The simplest of machine learning models, where the training set is the model. New instances are assigned the class label of the nearest training set example, or of the nearest k samples when voting (with the winner selected randomly in case of a tie).
神经网络
neural network
根据某种架构排列的神经元(节点)集合,其中输入逐层映射到输出。神经网络是现代人工智能的基础。从历史上看,神经网络作为联结主义的一种表达方式,被认为是不成功的,而且作用不大。深度学习革命证明了事实并非如此。
A collection of neurons (nodes) arranged according to some architecture where an input is mapped, layer by layer, to an output. Neural networks are the foundation of modern artificial intelligence. Historically, neural networks, an expression of connectionism, were regarded as unsuccessful and only marginally useful. The deep learning revolution proved otherwise.
神经元
neuron
神经网络的基本单位,因其表面上与生物神经元的相似性而得名。请参阅节点。
The fundamental unit of a neural network, named for its superficial similarity to biological neurons. See node.
节点
node
神经网络的基本单元。节点接受多个输入,乘以权重,并与偏差值一起求和。结果数字被传递给激活函数以产生节点的输出值。训练相对于网络和训练数据集中的其他节点,找到适合该节点的权重和偏差值。
The fundamental unit of a neural network. Nodes accept multiple inputs, multiplied by weights, that are summed along with a bias value. The resulting number is passed to an activation function to produce the node’s output value. Training locates the weight and bias values appropriate for the node in relation to other nodes in the network and the training dataset.
噪声向量
noise vector
在生成对抗网络中,噪声向量是从正态分布中随机抽取的数字集合,通常范围在 10 到 100 左右。噪声向量决定了网络生成器部分创建的输出(图像)。
In a generative adversarial network, the noise vector is a collection of numbers, typically in the range of 10 to 100 or so, that are drawn randomly from a normal distribution. The noise vector determines the output (image) created by the generator portion of the network.
独热编码
one-hot encoding
表示许多模型所需的类标签的另一种方法。 one-hot 编码是一个具有与类一样多的元素的向量。通过将 one-hot 向量的该元素设置为 1,同时将所有其他元素设置为零来指定类。
An alternative way to represent the class labels required by many models. A one-hot encoding is a vector with as many elements as there are classes. A class is specified by setting that element of the one-hot vector to one while setting all other elements to zero.
一对一
one-versus-one
一种将支持向量机扩展到多类分类任务的方法。 SVM 针对每对类标签进行训练。如果有n个类,此方法需要训练n ( n – 1)/2 个 SVM。
An approach to extending a support vector machine to a multiclass classification task. An SVM is trained for each pairing of class labels. If there are n classes, this approach requires training n(n – 1)/2 SVMs.
一对一
one-versus-rest
一种将支持向量机扩展到多类分类任务的方法。 SVM 通过将每个类与所有其他类的聚合进行比较来进行训练。如果有n 个类,则此方法需要n 个模型。
An approach to extending a support vector machine to a multiclass classification task. An SVM is trained by comparing each class with the aggregation of all the others. This approach requires n models if there are n classes.
过拟合
overfitting
学习训练集的微小细节,而不学习泛化到新数据。过度拟合是许多机器学习模型(尤其是决策树)的一个问题,但对于大型神经网络来说,这似乎不是一个问题。
Learning the minute details of the training set without learning to generalize to new data. Overfitting is a problem with many machine learning models, especially decision trees, but it appears to be less of an issue with large neural networks.
参数
parameters
模型中任何可调整数量的通用术语。通常,术语“参数”用于统指神经网络的权重和偏差。
A generic term for any quantity in a model that can be adjusted. Usually, the term “parameters” is used to refer collectively to the weights and biases of a neural network.
池化层
pooling layer
一种神经网络层,常见于卷积神经网络等高级模型中。池化层没有可学习的参数(没有权重),但通过选择小输入区域中的最大值或平均值来执行空间缩减(通常为 2 倍)。池化的作用类似于卷积,但池化内核不重叠,因此 2×2 池化在每个方向上将空间范围缩小了两倍。
A kind of neural network layer often found in advanced models like convolutional neural networks. Pooling layers have no learnable parameters (no weights) but perform a spatial reduction, usually by a factor of 2, by selecting the maximum or average value in a small input region. Pooling acts like convolution, but the pooling kernel does not overlap, so a 2×2 pooling reduces the spatial extent by a factor of two in each direction.
预处理
preprocessing
用于描述在使用数据集训练或使用任何类型的模型之前对数据集进行的任何操作的通用术语。例如,当输入特征范围相似且平均值接近零时,许多机器学习模型表现最佳。改变数据集以实现这种情况,称为“标准化”,是一个预处理步骤。图像的预处理步骤可能是将其转换为灰度或删除 Alpha 通道。预处理是构建数据集的重要组成部分。
The generic term used to describe any manipulation of a dataset before using that dataset to train or work with any kind of model. For example, many machine learning models perform best when the input feature ranges are similar and near an average of zero. Altering the dataset so that is the case, known as “standardizing,” is a preprocessing step. A preprocessing step for images might be to convert them to grayscale or remove an alpha channel. Preprocessing is an essential part of building datasets.
随机森林
random forest
使用可用特征的随机选择在袋装(重新采样)数据集上训练的决策树集合。森林中的每棵树都会对新的特征向量进行分类,然后在整个森林中对结果进行投票以得出最终输出。
A collection of decision trees trained on bagged (resampled) datasets using random selections of the available features. Each tree in the forest classifies new feature vectors, and the result is voted upon across the forest to arrive at the final output.
修正线性单元 (ReLU)
rectified linear unit (ReLU)
深度学习中广泛使用的激活函数。如果输入为正,则输出就是输入。否则,输出为零。
An activation function widely used in deep learning. If the input is positive, the output is the input. Otherwise, the output is zero.
循环神经网络(RNN)
recurrent neural network (RNN)
一种将其输出作为输入反馈的神经网络。 RNN 具有历史意义,但难以训练。它们使用前一个标记的输出作为输入以及下一个标记来处理时间序列输入。存在多种 RNN,但它们都具有短期记忆,这使得它们不适合需要长期关联的任务。参见变压器。
A kind of neural network that feeds its output back in as an input. RNNs are historically relevant but difficult to train. They process time series inputs using the output from the previous token as an input along with the next token. Varieties of RNNs exist, but all have short-term memories, making them unsuitable for tasks requiring long-term associations. See transformer.
正则化器
regularizer
任何指导或推动神经网络训练的事物,使其学习训练集的特征,这些特征扩展到新的、未知的输入,而不是专注于无法概括的微小细节。数据增强是一种正则化器,就像向损失函数添加某些项一样。
Anything that directs or nudges the training of a neural network so that it learns the characteristics of the training set that extend to new, unknown inputs instead of focusing on minute details that do not generalize. Data augmentation is a regularizer, as is adding certain terms to the loss function.
根据人类反馈进行强化学习 (RLHF)
reinforcement learning from human feedback (RLHF)
OpenAI 使用人机交互步骤来帮助调整 GPT 模型的输出,以反映人类的期望和社会需求。模型的输出由人工审阅者评分,然后再次输入以调节输出。
A human-in-the-loop step used by OpenAI to help align the output of GPT models to reflect human expectations and social requirements. The model’s outputs are graded by human reviewers and then fed in again to condition the output.
日程
schedule
在扩散模型中,“调度”是指在训练图像中添加噪声,或者在从随机噪声生成图像的逆过程中将噪声去除。
In diffusion models, “schedule” refers to adding noise to a training image or removing it during the reverse process when generating an image from random noise.
语义分割
semantic segmentation
分类器通常输出类标签。一些分类器输出边界框来定位图像中识别的对象。语义分割将输入的每个像素分配给一个类,从而可以轻松地分割对象。
Classifiers typically output a class label. Some classifiers output bounding boxes to localize an object identified in an image. Semantic segmentation assigns every pixel of the input to a class, thereby allowing easy segmentation of objects.
乙状结肠
sigmoid
一种激活函数,在x = 0 时生成值为 0.5 的S形曲线,从负无穷大的 0 到正无穷大的 1(见图1)。由于这种压缩的 0 到 1 范围,sigmoid 函数(也称为逻辑函数)通常用在二元神经网络的输出层中来表示类似概率的值,接近 1 的值意味着更强烈的信念网络的输入是正类(或目标类)的实例。 sigmoid 的多值版本称为softmax。
An activation function that produces an S-shaped curve with a value of 0.5 at x = 0 and running from 0 at negative infinity to 1 at positive infinity (see Figure 1). Because of this compressed 0 to 1 range, sigmoid functions (also known as logistic functions) are often used in the output layer of a binary neural network to represent a probability-like value, with a value closer to 1 implying a stronger belief by the network that the input is an instance of the positive (or target) class. The multivalued version of the sigmoid is known as the softmax.
图 1:Sigmoid 函数
Figure 1: The sigmoid function
随机梯度下降
stochastic gradient descent
梯度下降使用微积分来跟踪误差函数的斜率(梯度)向下趋向最小值。斜率是根据模型的误差及其当前的权重和偏差集对训练数据进行估计的。随机梯度下降在估计斜率时并不使用所有训练数据。相反,它使用随机选择的子集,即小批量。这样做有两个原因:为了节省计算时间,并且因为随机错误(“随机”一词意味着随机性)梯度通常似乎是避免局部最小值的更好估计。最终,随机梯度下降会产生性能更好的模型,这是使用它的充分理由。
Gradient descent uses calculus to follow the slope (gradient) of the error function down toward a minimum. The slope is estimated from the model’s error, with its current set of weights and biases, on the training data. Stochastic gradient descent does not use all of the training data when estimating the slope. Instead, it uses a randomly selected subset, a minibatch. This is done for two reasons: to save computation time and because the randomly wrong (the word “stochastic” implies randomness) gradient often seems to be a better estimate for avoiding local minima. Ultimately, stochastic gradient descent produces better-performing models, which is reason enough to use it.
支持向量机(SVM)
support vector machine (SVM)
一种在 20 世纪 90 年代和 2000 年代初流行的机器学习模型,因为它总体上很有效,而且可以在不需要神经网络那样的巨大计算成本的情况下进行训练。深度学习革命已经在很大程度上用神经网络取代了 SVM,但 SVM 在机器学习领域仍然占有一席之地。
A machine learning model popular in the 1990s and early 2000s because of its effectiveness overall and because it can be trained without the enormous computational cost encountered by neural networks. The deep learning revolution has largely replaced SVMs with neural networks, but SVMs still have a place at the machine learning table.
群体智能
swarm intelligence
基于一群个体代理行为的优化的通用形式。群体算法很流行并且通常很强大,特别是在优化无法通过数学方式(通过使用微积分)优化的事物时。实际上,群体智能算法(如粒子群优化)可以应用于许多与进化算法相同的情况。有些人认为群体智能和进化算法是人工智能的一种形式。我不这样做,尽管我经常使用两者。
A generic form of optimization based on the behavior of a swarm of individual agents. Swarm algorithms are popular and often powerful, especially when optimizing things that cannot be optimized mathematically (by using calculus). In practical terms, swarm intelligence algorithms, like particle swarm optimization, can be applied in many of the same situations as evolutionary algorithms. Some people consider swarm intelligence and evolutionary algorithms a form of AI. I do not, though I use both frequently.
测试
testing
在机器学习中,测试意味着使用新训练的模型,无论类型如何,并在训练期间保留数据集。由于预期输出已知,测试会创建有助于评估模型的数据,例如使用混淆矩阵或从混淆矩阵导出的指标。
In machine learning, testing means using a newly trained model, regardless of type, with a dataset held back during training. Because the expected output is known, testing creates data helpful in evaluating the model, for example with a confusion matrix or metrics derived from the confusion matrix.
代币
token
大型语言模型将其文本提示解析为小块,这些小块可能是单个单词、单词的一部分或单个字符。这些是令牌。同样,LLM 在响应用户提供的提示时会输出一个又一个的令牌。
Large language models parse their text prompts into small pieces, which may be individual words, parts of words, or single characters. These are tokens. Likewise, LLMs output token after token when responding to user-supplied prompts.
训练
training
将模型参数调整为特定数据集或用例的行为。训练需要什么取决于模型的形式,从几乎没有(最近邻分类器)到令人难以置信的计算水平(训练像 GPT-4 这样的大型语言模型)。所有机器学习模型都从训练数据集中学习,使机器学习(包括深度学习)成为一种实证练习。如果训练数据好,模型就会好。如果训练数据较差或不完整,模型的性能也会很差。垃圾进垃圾出。
The act of conditioning the parameters of a model to a specific dataset or use case. What training entails depends on the form of the model, from virtually nothing (nearest neighbor classifiers) to incredible levels of computation (training a large language model like GPT-4). All machine learning models learn from the training dataset, making machine learning, including deep learning, an empirical exercise. If the training data is good, the model can be good. If the training data is poor or incomplete, the model’s performance will also be poor. Garbage in, garbage out.
变压器
transformer
一种相对较新的神经网络架构,是 GPT-4 等大型语言模型的核心。 (“T”表示“变压器”。)变压器模型结合了注意力,可用于传统上使用循环神经网络的情况。具有大输入窗口(GPT-4 约为 30,000 个令牌)的 Transformer 可以将模型注意力集中在窗口内的任何位置。
A relatively new neural network architecture at the heart of large language models like GPT-4. (The “T” means “transformer.”) Transformer models incorporate attention and can be used in situations where recurrent neural networks were traditionally used. Transformers with large input windows (GPT-4’s is some 30,000 tokens) can focus model attention anywhere inside the window.
真阴性
true negative
模型将 0 类的样本分配给 0 类。类 0 是二类分类器中的负类。
A sample of class 0 assigned to class 0 by the model. Class 0 is the negative class in a two-class classifier.
真阳性
true positive
模型将 1 类的样本分配给 1 类。第 1 类是二分类器中的正类。
A sample of class 1 assigned to class 1 by the model. Class 1 is the positive class in a two-class classifier.
重量
weight
与节点的特定输入相乘的单个数字(标量)。神经网络的特定权重(和偏差)将模型调整为特定的数据集;也就是说,它们是神经网络的参数。训练使用梯度下降,它使用反向传播来定位一组好的权重和偏差。最后,这都是关于权重和偏差的。
A single number (a scalar) that multiplies a particular input to a node. The specific weights (and biases) of a neural network condition the model to a particular dataset; that is, they are the parameters of the neural network. Training uses gradient descent, which uses backpropagation to locate a good set of weights and biases. In the end, it’s all about the weights and biases.
人工智能资源的数量非常多。我在这里只列出了一些,大部分是书籍,还有在线资源(它们的寿命通常有限)。我希望它们对您有所帮助。
The number of AI resources out there is legion. I’m only listing a few here, mostly books, but also online resources (which often have a limited lifespan). I hope you find them helpful.
****
****
关于人工智能的一般书籍包括:
General books about AI include:
迈克尔·伍德里奇 (Michael Wooldridge) 的《人工智能简史》(Flatiron Books,2021 年)
A Brief History of Artificial Intelligence by Michael Wooldridge (Flatiron Books, 2021)
我在第二章中对历史进行了更彻底和平衡的描述,正如我在那里所说的那样,这必然是有偏见的。
A more thorough and balanced account of the history I presented in Chapter 2, which, as I stated there, was necessarily biased.
《这可能很重要:我与人工智能的生活和时代》帕梅拉·麦考达克(Pamela McCorduck)(Lulu Press,2019)
This Could Be Important: My Life and Times with the Artificial Intelligentsia by Pamela McCorduck (Lulu Press, 2019)
关于人工智能发展的另一个个人叙述。
Another, personal, account of the development of AI.
你看起来像个东西,我爱你:人工智能如何运作以及为什么它让世界变得更奇怪, Janelle Shane 着(Voracious,2019)
You Look Like a Thing and I Love You: How Artificial Intelligence Works and Why It’s Making the World a Weirder Place by Janelle Shane (Voracious, 2019)
对于本书中讨论的许多主题的另一种看法。
For an alternate take on many of the topics discussed in this book.
深度学习:安德鲁·格拉斯纳 (Andrew Glassner) 的视觉方法(No Starch Press,2021 年)
Deep Learning: A Visual Approach by Andrew Glassner (No Starch Press, 2021)
一本通用的、主要是视觉的书,更详细地涵盖了许多主题,但仍然没有数学负担。
A general, primarily visual book that covers many topics in more detail, yet still without the burden of mathematics.
****
****
如果您准备好投入人工智能领域,那么以下书籍都是合乎逻辑的后续步骤:
If you’re ready to take the plunge into AI proper, the following books are all logical next steps:
《Python 深度学习》,第二版,作者:François Chollet(曼宁,2021 年)
Deep Learning with Python, 2nd edition, by François Chollet (Manning, 2021)
由 Keras 的创建者编写,Keras 是一种流行的基于 Python 的工具,可以使构建神经网络变得更加简单。
Written by the creator of Keras, a popular Python-based tool that makes building neural networks vastly simpler.
深度学习的数学:理解神经网络需要了解什么,作者:Ronald T. Kneusel(No Starch Press,2021 年)
Math for Deep Learning: What You Need to Know to Understand Neural Networks by Ronald T. Kneusel (No Starch Press, 2021)
这本书有意回避数学;深度学习数学则相反,让你为现代人工智能中的数学做好准备。
This book intentionally avoided mathematics; Math for Deep Learning does the opposite, preparing you for the mathematics found in modern AI.
实用深度学习:基于 Python 的介绍,作者:Ronald T. Kneusel(No Starch Press,2021 年)
Practical Deep Learning: A Python-Based Introduction by Ronald T. Kneusel (No Starch Press, 2021)
从这里开始使用人工智能。
Start here to begin using AI.
深度学习基础知识:设计下一代机器智能算法,第二版,作者:Nithin Buduma 等人。 (奥莱利,2022)
Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms, 2nd edition, by Nithin Buduma et al. (O’Reilly, 2022)
本书涵盖了实用深度学习之外的其他主题。
This book covers additional topics beyond Practical Deep Learning.
****
****
存在许多与人工智能相关的在线资源。以下是一些您可能会觉得有用的内容:
Many online resources related to AI exist. Here are a few you might find useful:
神经网络和深度学习( http://www.neuralnetworksanddeeplearning.com )
Neural Networks and Deep Learning (http://www.neuralnetworksanddeeplearning.com)
迈克尔·尼尔森 (Michael Nielsen) 的免费在线书籍。非常值得一看。
A free online book by Michael Nielsen. Well worth a look.
Coursera 机器学习专业化 ( https://www.coursera.org/specializations/machine-learning-introduction )
Coursera Machine Learning Specialization (https://www.coursera.org/specializations/machine-learning-introduction)
Coursera 最初是一门在线机器学习课程。您可以免费审核该专业,涵盖您需要的一切。
Coursera started as an online machine learning course. This specialization, which you can audit for free, covers everything you need.
“图解 GPT-2” ( https://jalammar.github.io/illusterated-gpt2 )
“The Illustrated GPT-2” (https://jalammar.github.io/illustrated-gpt2)
一篇非常好的文章,详细介绍了大型语言模型的工作原理,并配有动画。
A very nice post detailing how large language models work, complete with animations.
人工智能解释(https://www.youtube.com/@aiexplained-official)
AI Explained (https://www.youtube.com/@aiexplained-official)
一个 YouTube 频道,以深思熟虑且清晰的方式呈现最新新闻。如果您想了解人工智能领域正在发生什么,这是一个很好的起点。
A YouTube channel with up-to-the-minute news thoughtfully and clearly presented. If you want to know what’s happening in AI, this is a good place to start.
计算机爱好者 ( https://www.youtube.com/@Computerphile )
Computerphile (https://www.youtube.com/@Computerphile)
诺丁汉大学的一个经典 YouTube 频道,讨论计算机的所有事物,包括人工智能。
A classic YouTube channel from the University of Nottingham that discusses all things computer, including AI.
莱克斯·弗里德曼播客 ( https://www.youtube.com/@lexfridman )
Lex Fridman Podcast (https://www.youtube.com/@lexfridman)
弗里德曼是麻省理工学院的教授,经常采访人工智能领域的领导者。
Fridman is a professor at MIT and frequently interviews leaders in AI.
准确度, 58
accuracy, 58
激活函数,62
activation function, 62
亚历克斯网,34
AlexNet, 34
算法, 5
algorithm, 5
对齐,132
alignment, 132
建筑, 69
architecture, 69
artificial intelligence, 5, 29
算法, 8
algorithm, 8
通用人工智能,116
artificial general intelligence, 116
狭义人工智能,116
artificial narrow intelligence, 116
型号, 7
model, 7
狭义与一般,116
narrow vs. general, 116
监督学习,8
supervised learning, 8
阿西洛玛有益人工智能会议,36
Asilomar Conference on Beneficial AI, 36
罗文·阿特金森,25 岁
Atkinson, Rowan, 25
注意,131
attention, 131
自动微分,95
automatic differentiation, 95
自动机器学习,94
AutoML, 94
艾尔斯,约翰·W.,147
Ayers, John W., 147
向后传球,69
backward pass, 69
批量归一化,40
batch normalization, 40
希莱尔·贝洛克,19 岁
Belloc, Hilaire, 19
约书亚·本吉奥,33 岁
Bengio, Yoshua, 33
偏见,63
bias, 63
二元模型,9
binary model, 9
丹尼尔·博伊科,148
Boiko, Daniil A., 148
乔治·布尔,27 岁
Boole, George, 27
布尔代数,27
Boolean algebra, 27
尼克·博斯特罗姆,150
Bostrom, Nick, 150
莱昂·博图,33 岁
Bottou, Léon, 33
边界框,93
bounding box, 93
乔治·博克斯,8 岁
Box, George, 8
布雷曼,狮子座,33 岁
Breiman, Leo, 33
塞巴斯蒂安·布贝克,148
Bubeck, Sébastien, 148
卡罗尔·刘易斯,147
Carroll, Lewis, 147
中央处理器(CPU),38
central processing unit (CPU), 38
程尤金妮亚,151
Cheng, Eugenia, 151
崔艺珍,141
Choi, Yejin, 141
分类器,12
classifier, 12
计算图,95
computational graph, 95
条件 GAN,103
conditional GAN, 103
准确度, 58
accuracy, 58
马修斯相关系数,58
Matthews correlation coefficient, 58
联结主义,26
connectionism, 26
上下文编码,131
context encoding, 131
上下文窗口,132
context window, 132
卷积,85
convolution, 85
卷积层,87
convolutional layer, 87
convolutional neural network (CNN), 30, 33, 86
亚历克斯网,34
AlexNet, 34
有效感受野,90
effective receptive field, 90
嵌入,92
embedding, 92
端到端学习,81
end-to-end learning, 81
过滤器,88
filter, 88
勒内特,87
LeNet, 87
池化层,87
pooling layer, 87
相关性,20
correlation, 20
科琳娜·科尔特斯,32 岁
Cortes, Corinna, 32
盖,托马斯,30 岁
Cover, Thomas, 30
CPU(中央处理器),38
CPU (central processing unit), 38
维数诅咒,46
curse of dimensionality, 46
曲线拟合,17
curve fitting, 17
达特茅斯研讨会,29
Dartmouth workshop, 29
数据
data
特点, 10
features, 10
特征空间,11
feature space, 11
特征向量,10
feature vector, 10
硬底片,21
hard negatives, 21
标记, 8
labeled, 8
样品,10
sample, 10
测试集,10
test set, 10
训练集,10
training set, 10
数据增强,70
data augmentation, 70
数据集
dataset
恐龙脚印,56
dinosaur footprints, 56
MNIST, 14 , 33 , 45 , 82 , 100
解码器,130
decoder, 130
深蓝色,32
Deep Blue, 32
算法, 8
algorithm, 8
工具包, 94
toolkits, 94
致密层,86
dense layer, 86
扩散模型,107
diffusion model, 107
有条件,109
conditional, 109
错误,111
errors, 111
固定伪随机数生成器种子,111
fixed pseudorandom number generator seed, 111
前向算法,109
forward algorithm, 109
逆向算法,109
reverse algorithm, 109
稳定扩散,110
Stable Diffusion, 110
恐龙脚印,56
dinosaur footprints, 56
鉴别器,97
discriminator, 97
分布, 18
distribution, 18
辍学,40
dropout, 40
双数,95
dual numbers, 95
理查德·杜达,30 岁
Duda, Richard, 30
有效感受野,90
effective receptive field, 90
嵌入,92
embedding, 92
嵌入空间,130
embedding space, 130
出现, 135
emergence, 135
端到端学习,81
end-to-end learning, 81
纠缠,105
entanglement, 105
纪元,76
epoch, 76
欧几里得,125
Euclid, 125
进化算法,59
evolutionary algorithm, 59
专家系统,31
expert system, 31
可解释的人工智能,22
explainable AI, 22
彼得·L·法金厄姆,56 岁
Falkingham, Peter L., 56
特点, 10
features, 10
特征空间,11
feature space, 11
特征向量,10
feature vector, 10
过滤器,88
filter, 88
向前传球,69
forward pass, 69
弗里德曼,莱克斯,150
Fridman, Lex, 150
概括,11
generalization, 11
generative adversarial network, 34, 97
有条件,103
conditional, 103
可控,104
controllable, 104
鉴别器,97
discriminator, 97
纠缠,105
entanglement, 105
发电机,97
generator, 97
模式崩溃,103
mode collapse, 103
噪声向量,100
noise vector, 100
训练算法,98
training algorithm, 98
生成式人工智能,34
generative AI, 34
生成预训练变压器(GPT),130
generative pretrained transformer (GPT), 130
发电机,97
generator, 97
遗传编程,59
genetic programming, 59
全球最低值,72
global minimum, 72
加布·戈麦斯,148
Gomes, Gabe, 148
伊恩·古德费洛,34 岁
Goodfellow, Ian, 34
全局最小值与局部最小值,72
global vs. local minima, 72
学习率,75
learning rate, 75
图形处理单元 (GPU), 38
graphics processing unit (GPU), 38
帕特里克·哈夫纳,33 岁
Haffner, Patrick, 33
彼得·哈特,30 岁
Hart, Peter, 30
隐藏层,64
hidden layer, 64
杰弗里·辛顿,32 岁
Hinton, Geoffrey, 32
约翰·霍普菲尔德,31 岁
Hopfield, John, 31
超参数,55
hyperparameters, 55
情境学习,132
in-context learning, 132
初始化,67
initialization, 67
丹尼尔·卡尼曼,121
Kahneman, Daniel, 121
加里·卡斯帕罗夫,32 岁
Kasparov, Garry, 32
萨尔汗,146
Khan, Sal, 146
米哈尔·科辛斯基,119
Kosinski, Michal, 119
延斯·N·拉伦萨克,56 岁
Lallensack, Jens N., 56
大语言模型(法学硕士),116
large language model (LLM), 116
对齐,132
alignment, 132
注意,131
attention, 131
上下文编码,131
context encoding, 131
上下文窗口,132
context window, 132
解码器,130
decoder, 130
嵌入空间,130
embedding space, 130
出现, 135
emergence, 135
情境学习,132
in-context learning, 132
预训练,130
pretraining, 130
强化学习,34
reinforcement learning, 34
来自人类反馈 (RLHF),132
from human feedback (RLHF), 132
代币, 130
token, 130
变压器,130
transformer, 130
泄漏 ReLU,101
leaky ReLU, 101
学习率,75
learning rate, 75
李彼得,148
Lee, Peter, 148
戈特弗里德·莱布尼茨,27
Leibniz, Gottfried, 27
勒内特,87
LeNet, 87
詹姆斯·莱特希尔,30 岁
Lighthill, James, 30
局部最小值,73
local minimum, 73
损失,70
loss, 70
算法, 8
algorithm, 8
罗伯特·麦克奈特,148
MacKnight, Robert, 148
歧管,47
manifold, 47
矩阵,9
matrix, 9
马修斯相关系数,58
Matthews correlation coefficient, 58
帕梅拉·麦考达克,26 岁
McCorduck, Pamela, 26
机械土耳其人,26
Mechanical Turk, 26
公制,56
metric, 56
小批量,76
minibatch, 76
MNIST, 14 , 33 , 45 , 82 , 100
模式崩溃,103
mode collapse, 103
型号, 7
model, 7
二进制,9
binary, 9
分类器,12
classifier, 12
专家系统,31
expert system, 31
概括,11
generalization, 11
超参数,55
hyperparameters, 55
参数, 8
parameters, 8
support vector machine (SVM), 32, 52
培训, 8
training, 8
多层感知器,78
multilayer perceptron, 78
约翰·纳什,29 岁
Nash, John, 29
新认知机,31
Neocognitron, 31
神经网络,26
neural network, 26
激活函数,62
activation function, 62
亚历克斯网,34
AlexNet, 34
偏见,63
bias, 63
深度学习,34
deep learning, 34
致密层,86
dense layer, 86
generative adversarial, 34, 97
生成式人工智能,34
generative AI, 34
隐藏层,64
hidden layer, 64
初始化,67
initialization, 67
损失,70
loss, 70
神经元,62
neuron, 62
节点,63
node, 63
预处理,66
preprocessing, 66
经常性,130
recurrent, 130
正则化器,70
regularizer, 70
强化学习,34
reinforcement learning, 34
训练算法,69
training algorithm, 69
无监督学习,34
unsupervised learning, 34
重量,63
weight, 63
神经元,62
neuron, 62
诺亚,特雷弗,148
Noah, Trevor, 148
节点,63
node, 63
噪声向量,100
noise vector, 100
正态分布,99
normal distribution, 99
标准化,40
normalization, 40
朱利安·奥弗里·德拉·梅特里,27 岁
Offray de La Mettrie, Julien, 27
独热编码,103
one-hot encoding, 103
一对一,56
one-versus-one, 56
一对一,56
one-versus-rest, 56
乔·潘托利亚诺,149
Pantoliano, Joe, 149
回形针灾难,150
paper clip catastrophe, 150
西摩·帕普特,30 岁
Papert, Seymour, 30
参数, 8
parameters, 8
约瑟夫·佩特罗,148
Petro, Joseph, 148
沃尔特·皮茨,28 岁
Pitts, Walter, 28
池化层,87
pooling layer, 87
预处理,66
preprocessing, 66
预训练,130
pretraining, 130
初级视觉皮层,84
primary visual cortex, 84
提示工程师,111
prompt engineer, 111
修正线性单元(ReLU),39
rectified linear unit (ReLU), 39
泄漏 ReLU,101
leaky ReLU, 101
递归神经网络(RNN),130
recurrent neural network (RNN), 130
正则化器,70
regularizer, 70
强化学习,34
reinforcement learning, 34
来自人类反馈 (RLHF),132
from human feedback (RLHF), 132
安东尼·罗米利奥,56 岁
Romilio, Anthony, 56
大卫·鲁梅尔哈特,32 岁
Rumelhart, David, 32
斯图尔特·拉塞尔,150
Russell, Stuart, 150
样品,10
sample, 10
时间表,107
schedule, 107
李世石,35 岁
Sedol, Lee, 35
塞诺夫斯基,泰伦斯,140
Sejnowski, Terrence, 140
语义分割,93
semantic segmentation, 93
克劳德·香农,29 岁
Shannon, Claude, 29
沙泽尔,诺姆,130
Shazeer, Noam, 130
玛丽·雪莱,27 岁
Shelley, Mary, 27
乙状结肠,64
sigmoid, 64
雷·所罗门诺夫,29 岁
Solomonoff, Ray, 29
“通用人工智能的火花”,116
“Sparks of Artificial General Intelligence,” 116
稳定扩散,110
Stable Diffusion, 110
随机梯度下降,76
stochastic gradient descent, 76
鹳鸟,大卫,30 岁
Stork, David, 30
监督学习,8
supervised learning, 8
support vector machine (SVM), 32, 52
支持向量,54
support vector, 54
群体智能,59
swarm intelligence, 59
符号人工智能,26
symbolic AI, 26
伊恩·塔特索尔,37 岁
Tattersall, Ian, 37
泰格马克,麦克斯,150
Tegmark, Max, 150
测试, 10
testing, 10
测试集,10
test set, 10
代币, 130
token, 130
培训, 8
training, 8
训练集,10
training set, 10
变压器,130
transformer, 130
注意,131
attention, 131
解码器,130
decoder, 130
编码器,130
encoder, 130
真阴性,57
true negative, 57
真阳性,57
true positive, 57
图灵机,28
Turing machine, 28
图灵测试,28
Turing test, 28
阿莫斯·特沃斯基,121
Tversky, Amos, 121
嗯,伦纳德,30 岁
Uhr, Leonard, 30
无监督学习,34
unsupervised learning, 34
文森特·梵高,83
van Gogh, Vincent, 83
弗拉基米尔·瓦普尼克,32 岁
Vapnik, Vladimir, 32
瓦斯瓦尼,阿什什,130
Vaswani, Ashish, 130
矢量, 9
vector, 9
查尔斯·沃斯勒,30 岁
Vossler, Charles, 30
重量,63
weight, 63
罗纳德·威廉姆斯,32 岁
Williams, Ronald, 32
迈克尔·伍德里奇,26 岁
Wooldridge, Michael, 26
莱特,利蒂西亚,148
Wright, Letitia, 148
How AI Works中使用的字体有 New Baskerville、Futura、The Sans Mono Condensed 和 Dogma。这本书由 Boris Veytsman 使用 L ATEX 2 ɛ包nostarch进行排版,并由 Alex Freed 和No Starch Press团队的其他成员添加了许多内容(2023/04/08 v2.1 NoStarch Press 排版书籍)。
The fonts used in How AI Works are New Baskerville, Futura, The Sans Mono Condensed, and Dogma. The book was typeset with LATEX 2ɛ package nostarch by Boris Veytsman with many additions by Alex Freed and other members of the No Starch Press team (2023/04/08 v2.1 Typesetting books for No Starch Press).